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VLSI  Implementation  of  Digital  Fourier  Transforms 


1.  OVERVIEW 

In  the  late  1970's  a  modular,  high-throughput  architecture  for  large  scale 
Fourier  Transform  processors  was  developed  in  [1,2].  This  architecture  uses 
only  a  few  basic  modules  in  a  highly  pipelined  arrangement  and  some  serial 
memory  for  temporary  storage  of  operands.  This  streamlined  architecture 
seemed  predestined  for  implementation  with  "Charge  Transfer  Devices",  which 
have  proven  themselves  in  many  high-speed  signal  processing  applications  and 
for  serial  memory  [3l.  Thus  we  proposed  to  investigate  the  use  of  charge- 
coupled  devices  (CCDs)  in  the  implementation  of  pipelined  FFT  processors.  For 
various  reasons  which  are  explained  below,  the  use  of  CCD’s  was  dropped  at  an 
early  stage  and  the  decision  was  made  to  design  these  same  modules  with  stan¬ 
dard  silicon-gate  NMOS  technology. 

A  collection  of  the  basic  modules  used  in  these  FFT  architectures  have  been 
designed  and  implemented;  some  are  currently  in  fabrication.  These  modules 
are: 

L  a  specialized  ^-vector  rotator  which  uses  the  rational  approHmation  algo- 
rithm  developed  by  Despain  in  [2], 

2.  a  module  to  perform  a  multiplication  by  V3,  also  developed  in  [2], 

3.  a  general  CORDIC  rotator,  capable  of  rotating  a  complex  vector  by  any 
angle  with  16-bit  precision, 

4,5.  two  modules  capable  of  performing  the  "butterfly"  operation  of  the  Fast 
Fourier  Transform  (FFT), 

6.  a  barrel  shifter  which  was  designed  for  use  in  an  iterative  CORDIC  module. 

We  also  expanded  the  scope  of  our  research  to  study  the  more  general 
problem  of  efficiently  computing  the  Discrete  Fourier  Transform  with  proper 
attention  to  the  constraints  of  VLSI.  Many  of  the  results  are  applicable  to  most 
VLSI  technologies  (N-  or  P-channel  MOS,  bulk  CMOS,  SOS  CMOS.  IgL).  New  tech¬ 
niques  were  developed  for  the  construction  of  large  scale  FFT  processors  which 
are  geared  toward  the  use  of  VLSI.  These  techniques  employ  the  traditional 
Cooley-Tukey  version  of  the  FFT  [4]  as  well  as  the  prime-factor  algorithm  of 
Good  [5]  and  elements  of  the  Winograd  Fourier  Transform  [6]. 

At  a  lower  level,  we  developed  new  results  on  minimum  latency  adders 
which  would  be  useful  in  the  design  of  Fourier  Transform  processors  using  the 
CORDIC  or  rational  approximation  rotation  algorithms.  On  the  theoretical  side, 
studies  of  the  computational  complexity  of  various  Fourier  Transform  algo¬ 
rithms  were  made  using  a  VLSI  model  developed  by  Thompson  [7]. 

In  this  report  we  will  first  review  the  class  of  highly  pipelined  architectures 
for  FFT  processors  which  are  considered  for  implementation  with  VLSI  (sect  2  & 
3).  We  then  review  the  charge-transfer  techniques  and  the  more  classical  MOS 
technologies  and  discuss  the  trade-offs  for  the  implementation  of  the  envisioned 
FFT  processor  architectures  (sect  4).  In  section  5  we  present  a  detailed 
description  of  the  hardware  modules  that  were  designed  and  implemented  in 
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NMOS  technology  and  make  some  performance  predictions  for  a  complete  sys¬ 
tem.  In  section  6  we  present  new  theoretical  results  concerning  minimum 
latency  adders  which  could  be  used  in  FFT  processors  where  minimum  latency 
was  a  design  goal.  Also  there  are  results  from  the  application  of  complexity 
theory  which  produce  some  absolute  bounds  of  area  and  time  for  implementa¬ 
tions  of  FFT  processors  in  VLSI. 


\ 
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2.  INTRODUCTION 

2.1.  Fast  Fourier  Trnnsf  orm  background 

The  Discrete  Fourier  Transform  (DFT),  widely  used  in  many  areas  of  signal 
processing,  can  be  expressed  as 

jv-i  -*3aeL 

Ar=ZBke  N  r  =  0.1 . N- 1  (l) 

*»o 

This  computation  is  generally  used  to  transform  the  representation  of  a  set  of 
data  samples  from  the  time  or  space  domain  into  the  frequency  domain.  The 
Cooley-Tukey  FFT  [4]  is  a  factorization  of  this  equation  which  reduces  the 
number  of  multiplications  involved  from  0(N2)  to  O(NlogzN)  assuming  that  the 
radix-2  algorithm  is  used.  This  method  of  computing  (1)  consists  of  "butterfly’' 
operations  of  the  form 

C  =  A+B  (2) 

D  =  A-B 

and  multiplications  by  various  roots  of  unity.  The  Cooley-Tukey  algorithm  has  a 
great  deal  of  regularity  which  can  be  exploited  in  a  VLSI  implementation. 

Other  techniques  developed  by  Good  [5]  and  Winograd  [6]  can  be  utilized  to 
reduce  the  number  of  multiplications  required  to  compute  (l)  to  O(N)  for  cer¬ 
tain  values  of  N,  although  these  techniques  in  general  require  an  increase  in  the 
complexity  of  the  interconnections  involved.  'Hie  reduction  in  multiplications  is 
achieved  by  expressing  small  transforms  as  convolutions,  utilizing  fast  algo¬ 
rithms  to  perform  these  convolutions,  and  by  building  up  large  transforms  out  of 
these  smaller,  relatively  prime  modules. 

While  most  of  the  Discrete  Fourier  Transform  algorithms  which  have  been 
developed  have  been  for  existing  general-purpose  machines,  tremendous  speed- 
ups  are  possible  through  the  development  of  algorithms  and  hardware  simul¬ 
taneously.  Despain  [l]  pointed  out  that  the  complex  multiplications  in  (l)  are 
actually  vector  rotations  and  can  thus  be  computed  using  an  arithmetic  tech¬ 
nique  known  as  the  CORDIC  algorithm,  originally  developed  by  Voider  [8]. 
Depending  on  the  factors  of  N,  the  transform  size,  computational  savings  can 
also  be  realized  through  the  use  of  rational  approximations  for  rotations  as 
described  in  [2]. 


2.2.  EFT  Architectures 


Only  a  few  basic  functions  are  needed  to  implement  a  wide  set  of  Cooley- 
Tukey  type  FFT  algorithms  of  a  given  transform  length  and  transform  radix. 
These  are: 

1.  A  butterfly  module  which  performs  the  operation  in  (2)  above. 

2.  A  CORDIC  rotator  module  which  performs  the  calculation  of  Bel*  in  (1). 

3.  Shift  register  memories  for  intermediate  storage  of  data. 

The  pipeline  structure  shown  in  figure  1  is  derived  in  [l],  and  consists  only 
of  the  three  modules  listed  above.  Basically,  the  operation  of  the  processor  is  as 
follows.  The  first  butterfly  (BF)  module  allows  the  input  to  pass  unchanged 
into  the  shift  register  of  length  2n~1  until  it  is  full.  At  that  point,  the  incoming 
data  and  the  data  in  the  shift  register  are  combined  in  a  2-point  DFT: 


»  o«+a  v 
8 

bi.JL £ 


t  =  0 


•  •••• 


(3) 


C0RD1C 


CORDIC 


Figure  1:  Despain  Cascade 

The  bi  are  sent  to  the  CORDIC  rotator  module  which  applies  the  proper  rotations 
(twiddle  factors)  while  the  b  are  sent  back  into  the  shift  register.  When  all 

N  8 

—  2-point  DPTs  have  been  computed,  the  6^  £_  are  allowed  to  pass  out  of  the 

shift  register  into  the  CORDIC  rotator.  The  operation  of  the  next  butterfly 
module  is  similar,  except  that  each  input  datum  is  combined  with  one  —■  away. 

The  entire  computation  is  completed  after  n  =  logzjV  stages,  where  one  stage 
consists  of  the  butterfly  module,  CORDIC  rotator,  and  shift  register. 

In  [2],  Despain  points  out  that,  for  certain  values  of  N,  the  rotations 
involved  can  be  realized  with  less  hardware  than  that  which  is  required  for  a  foil 
CORDIC  rotator.  For  example,  the  rotations  involved  in  the  computation  of  the 

FFT  for  N =16  can  be  performed  with  a  set  of  ^  and  -—rotators.  Rotations 

by  these  angles  are  accomplished  through  the  use  of  rational  approximations 
which  can  be  chosen  for  ease  of  hardware  implementation.  An  example  will  be 

given  in  the  discussion  of  the  yg-chip  below. 
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a  MODULAR  CONSTRUCTION  OF  DFT  PROCESSORS 


ai.  Background 

VLSI  implementations  of  DFT  processors  can  be  communication-limited  due 
to  the  fact  that  the  number  of  pins  per  chip  is  fixed  at  about  100  to  200,  while 
the  number  of  transistors  per  chip  is  very  large,  about  10s  in  1982,  and  is  rising 
rapidly.  This  leads  to  a  fundamental  bandwidth  limitation  which  necessitates 
the  development  of  algorithms  and  computational  structures  which  minimize 
the  amount  of  communication  between  chips. 

The  pipeline  structure  of  Despain.  as  discussed  above,  lends  itself  to  a  VLSI 
implementation  due  to  the  ease  of  constructing  dynamic  registers  necessary  for 
pipeline  computations  [9].  However,  for  many  applications,  the  speed  of  the 
computation  needs  to  be  higher  than  that  which  is  attainable  by  the  use  of  a  sin¬ 
gle  pipeline,  and  parallel  structures  are  required.  The  construction  of  combined 
parallel  and  pipeline  processors  for  computing  the  DFT  is  the  subject  of  the  fol¬ 
lowing.  although  many  of  the  results  are  more  generally  applicable  to  any  VLSI 
implementation  of  a  DFT  processor  which  takes  advantage  of  the  inherent  paral¬ 
lelism.  These  structures  are  designed  to  minimize  the  amount  of  communica¬ 
tion  necessary  to  compute  the  DFT  and  to  make  the  communication  hardware  as 
simple  as  possible. 


3.2.  The  Radlx-2  Cooley-Tukey  Algorithm 


3.2.1.  Pipeline  Structure 

In  this  section,  we  define  N  ~  2f*  to  be  the  transform  size. 

Referring  again  to  the  structure  in  figure  1,  an  obvious  partitioning  would 
be  to  include  as  many  stages  as  possible  on  each  chip,  since  this  structure  poses 
no  communication  problems.  If  the  number  of  stages  per  chip  is  denoted  by  l, 
we  can  see  that  the  number  of  chips  necessary  to  perform  the  computation  is 
given  by 


Assuming  that  it  takes  T4  seconds  to  transmit  one  datum  through  the  pins  of 
each  chip  and  that  data  can  be  input  and  output  simultaneously,  we  can  proeess 
Ft  transforms  per  second  where  Ft  is  given  by 


Ft  = 


1 

t4n • 


(3) 


The  latency  of  this  structure  (time  between  the  input  of  the  first  data  sample 
and  the  output  of  the  first  result)  is  given  by 

7j  =  (n-l)Ti+nTB+N  (4) 


where  Tf  is  the  time  required  for  a  C0RDIC  rotation  and  Tb  is  the  time  required 
for  the  add  operation.  For  large  transforms,  the  N  term  will  dominate.  Despain 
notes  that  this  structure  is  also  capable  of  handling  2*  independent  channels  of 
length  N2~*  without  modifying  the  configuration,  although  the  results  are  avail¬ 
able  after  n—j  stages.  Since  the  structures  we  will  derive  will  be  proven  to  be 
functionally  identical  to  the  structure  above,  they  will  share  this  feature. 

It  should  be  noted  that  the  above  structure  is  inefficient  in  the  sense  that  it 
does  not  utilize  the  butterfly  module  while  passing  data  into  or  out  of  the  shift 
register.  The  structure  presented  by  Gold  and  Bially  [10]  avoids  this  and 
achieves  twice  the  transform  rate,  but  requires  twice  the  bandwidth.  Although 
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the  Gold  and  Bially  structure  will  not  be  discussed  specifically  in  the  following, 
the  results  will  be  applicable  to  it. 


3.2.2.  Notation 

Parker  [11]  has  recently  introduced  a  set  of  algebraic  tools  which  can  be 
used  to  describe  processor  networks  in  terms  of  their  patterns  of  connection. 
Although  his  notation  is  too  limited  to  be  used  directly  to  handle  parallel- 
pipeline  structures,  we  can  easily  extend  it  to  meet  our  needs.  The  operation  of 
one  stage  of  Despain's  structure  can  be  viewed  as  the  application  of  a  series  of 
operators  to  the  incoming  data  stream,  which  transform  the  data  from  one 
dimension  to  two  dimensions  and  which  perform  the  butterfly  operations.  If  the 
index  of  a  datum  is  defined  as  its  coordinates  [x,y]  in  the  data  stream,  where 
the  original  one-dimensional  data  stream  (one  row  and  N  columns)  is  led  by  a 
datum  with  index  [0,0]  and  ending  with  a  datum  with  index  [Af-1,0],  we  can 
define  the  operators  as  follows  The  first  operator  breaks  up  each  data  stream 
into  several  streams,  and  defines  the  operation  of  the  shift  register  in  each 
stage: 

*[[**•••  *  •  •  *i].EVv  •  •  Vi*fc  •  •  •  **-*♦»]]  (5) 

^wik)[*.y]  =  [[*»  •  ••  *k-t+iVi  •  •  •  ■  ■  ■  *»].[y*  •  yj*il] 

where  [x.y]  =  [[z*  •  -  *  XjJ.fy,,  -  -  -  yj]  when  x  and  y  are  described  in  binary 
notation.  This  is  well  defined  as  long  as  j*zk*zu.  Note  that  is  the  number  at 
rows  each  original  row  is  broken  into,  and  that  the  operator  processes  the  input 
data  in  chunks  of  2*  columns.  As  an  example  of  the  operation  of  consider  the 
case  n=3,  TV =8.  The  input  data  can  be  viewed  as  a  one-dimensional  array  com¬ 
ing  in  from  the  left  as 

[  dj  dg  d a  <£4  d]  <£2  ]■ 


Applying  the  operator  ^l-3)  converts  this  stream  to  a  two  dimensional  array  of 
size  2,x2"_1 


[  d7  d8  ^1 


»s 

67 


frs 


bx  60 

6b  64] 


It  is  easily  seen  that  the  two  rows  of  this  new  array  are  the  two  streams  which 
are  fed  simultaneously  into  the  first  butterfly  unit  of  the  Despain  cascade. 
Applying  the  operator  has  a  different  effect 

[  d7  da  de  d4  ds  dg  d,  d0  ] 

Since  2*  =2® =4,  the  input  data  is  broken  into  2  sets  of  4  columns  before  it  is 
transformed. 


4  64  6)  6 0 
7  bg  bg 


The  butterfly  operator,  B,  which  cannot  be  defined  in  terms  of  the  indexes 
of  the  data,  but  which  takes  the  rows  of  the  two-dimensional  data  and  combines 
them  in  pairs,  performing  a  2-point  DFT  on  each  column  in  the  pair,  and  places 
the  sum  output  in  the  row  with  the  smaller  y  index  and  the  difference  output  in 
the  row  with  the  larger  y  index.  For  simplicity,  we  will  include  the  twiddle  multi¬ 
ply  in  this  operation.  As  an  example,  a  structure  to  perform  a  2* -point 
transform  can  now  be  notated  as 

where  one  should  remember  that  the  output  is  in  bit-revereed  order.  "We  have 
leapt  to  Parker's  convention  and  written  the  order  of  operation  from  left  to 
right,  Le.  Wi»r^x.y]s(wiO»ri)([x.y  l)*»r«(rri[x,y  ]).  This  is  done  so  that  the  strings 


of  operators  will  match  the  structures  exactly  when  they  are  diagrammed. 

Using  this  notation,  we  can  define  a  decimation  operator  5  which  converts 
one  row  to  many  by  breaking  each  row  into  columns. 

=  [[*«  '  •  ■  •  *j]3  (?) 

In  this  case,  each  row  is  broken  into  2*  rows  and  the  operator  is  well  defined  if 
k*u.  Using  the  same  example  as  before,  the  operation  of  <5(d  is 

.  .  ,  3,.»  b9  b4  bS  bp 

[  d7  de  dg  dA  tf3  dz  dj  d0  ]  ^  4>a 


Hie  operation  of  5(8)  is  shown  by 


[ 


We  will  soon  see  a  need  for  an  operator  which  transposes  the  two  dimen¬ 
sional  data  in  subblocks  of  2*  rows  by  2*  columns. 

fyjjte.V]  =  [[**'••  *j+iVk  ‘  ’  ViMlfc  •  ’  •  Vk+t*j  ■  •  ’  *iH  (8) 

As  an  example,  take  n=4  where  the  input  stream  has  already  been  operated  on 
by  5(8). 

&18  bp  b4  bp  bp  bg  f>i  bp 

b,9  6g  65  6,  t*  bis  b|8  bp  b4 

b  i«  b10  b«  bt  •*  bn  610  b3  bt  • 

bio  6 11  b7  b3  Pio  b14  b7  b9 


With  these  definitions,  we  may  note  some  simple  identities: 
*  IMk+u)  i  +k*l 


( mj-ki-k)  j>k 

’■*“  ui' 

6(k)T(kj-k)  *  M(ihm)  (12) 

Proof  of  (9): 

•* *V ]  a  '  ’  ’  *l+i*l-k  '  '  1  xil*[lAf  1  ’  ’  Vi*i  '  '  '  ^-*4i]] 

*  [[*!»  ’  '  •  ♦!*»-{ *♦/)  ’  •  ■  xlMVv  •  •  •  Vl*l  •  •  •  *»-*+l*l-* 

•  •  • 

Proofs  for  the  other  identities  are  as  straightforward. 

The  usefulness  of  these  operators  and  their  realization  in  hardware  is  the 
next  topic. 


•  •  •  Vi*i  •  •  •  ^-*+i]] 
Vi*»  *  ‘  ‘ 


3.2.3.  Darteattoo.  of  Parallel-Pipeline  Structures 

We  make  the  first  attempt  at  parallelizing  Despain’s  structure  by  applying 
the  Gold  and  Bially  procedure  [10]. 

Thtortm  1: 

Pil.n)Bp(ia)  •  •  •  Pina)Bpin.n)  =  &{*)Pila-k)Bpii)n-k)  '  '  1  Pin-k.n-*)B  P^-ka-*)  (13) 

T(k.n-*)Piijt)Bpiilj,)  ■  ■  ■  fdkM^B fi^j, )/*$-* ) 

The  6//,)  merely  breaks  the  data  into  2*  rows  of  length  2"”*,  each  of  which  is 
transformed.  The  rows  and  columns  are  transposed  by  the  fun.*)  operator, 
and  the  rows  of  the  transposed  matrix  are  then  transformed.  The  final  M(»l-fc.n) 
rearranges  the  data  back  to  one  dimension. 

Proof  of  Thaoram  1:  The  identity  in  (12)  implies  that  i( jk)T{k.n-*)P{n-*.n)  is 
an  identity  transformation.  Therefore,  we  can  rewrite  the  left  hand  side  of  (13) 


P{l.n)B  P{\)n)  ‘  -  '  P{n-*a)Bpin-*a'fi(*)T{k.n-k)Pin-*a)  (14) 

%)Bp{n-*+i.n)  '  '  '  Pin.n)Bpina)- 

Note  the  identity 

AKiJtl^PoU^C*)  =  &{*)P<J»-k)Bpu\n-k)  j*n-k.k*n.  (15) 

Using  this  to  propagate  the  i(*j  to  the  left,  we  can  rewrite  (14)  as 

6(*)P(l.n-k)Bpfl!n-k)  ■  ■  ■  P(n-ka-*)Bp(n1-k.n-k)7’(*.n-k)Pin,-*a)  (16) 

Pin-**l.n)Bpin-*+la)  '  '  Pin.n)Bpin.n)- 

Using  the  identity  in  (11),  we  have 

&<*)Pi\.n-*)B H{in -* )  •  •  •  Pin-*.n-*)Bpin-*a-*)T{*.n-*)  (IT) 

Pi\M)BPiij*-i)BPi\*-Z)  •  •  BfJii  DB^n) 

Again  using  (11),  we  see  that 

Pii.*-i)  =  P<jlJ>)Pij+i*)  J+l«*  (18) 

which  allows  us  to  rewrite  (17)  as 

&(*)Pil.n-*)Bpi\}n-*)  '  '  Pin-*,n-*)Bp(n-*,n-k)T{*a-*)  (18) 

PUJt)Bpiilj,)Pizjt)Bf^ljt)  ■  ■  •  Pi*.*)Bpi^a)- 

One  final  use  of  (11)  gives  us  the  expression 

&{*)Pil.n-*)B p(in-*)  ■  •  •  Pin-*.n-*)Bp(n-*a-k)T^*.n-*)  (2°) 

PilM)Bpi\ ]*)  ■  •  ■  Pi*M)Bpi?*)P(n-*.n) 


which  completes  the  proof.  Note  that  this  theorem,  although  a  good  first  step 
toward  the  parallelization  of  (6),  does  not  give  us  a  very  good  structure,  since 
the  transposition  operation  in  the  middle  would  require  a  buffer  of  size  N  to 
implement  directly. 

The  next  step  is  to  break  down  Tnjt)  to  some  operators  which  are  easier  to 
realize  in  hardwire.  Let  us  define  the  following  row  intarvhxmga  operator 

*{»)[*.v]  *  C*-(y*  -vs  ivi9v*+i)  11  (21) 

where  "©”  means  boolean  "exclusive  or".  This  operator  leaves  the  first  2*  rows 
of  the  data  unchanged,  and  interchanges  the  next  2*  rows  in  pairs.  As  an  exam- 
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b*  bg 

be  b, 

*(D 

be  bg 

by  b3 

Ve  also  define  a  commutator  operator 

<p(*.i)[*.y]  *  [z.lvv  ■  ■  ■  vu+j+i  (y**/©**)  Vk+i- 1  •  •  •  vi]] 

*  ?(*.o)[*.y]  =  [*.[yv  •  •  ■  Vk*t  (y*©**)  y*-i  ■  •  •  yill- 

As  examples, 

6 12  6 e  64  60  6 13  be  bs  b0 

bt3  bg  b5  bi  ^  A 12  bg  b4  bi 

bio  b8  b8  -•  w18  b10  by  b8 

bjg  b,i  b7  b3  P14  bn  ba  bs 


b13 

V[i)  b‘* 


>12  bg  64  bo 
>13  *•  ®a  bi 

•u  6lO  b*  bg 
>19  bn  by  bg 


>u  bio  b4  b0 
1  is  bn  63  bt 

»ts  bB  bg  be 
is  b9  b7  bg 


Since  this  operator  is  just  a  permutation  within  each  column  of  the  data,  it  can 
be  implemented  with  a  commutator  (switch  or  multiplexer).  In  fact,  it  should  be 
noted  that  ?(»).... 9»(a-i)  requires  the  use  of  2* + '-pole  switches. 

Now  we  are  in  a  position  to  prove  the  following  lemma. 

Imrtuna  1: 

Tl kJt)  *  M(l*)R(k)M(~llk)  •  •  •  t*ikj,)R{k)M{kji)<P{k)  •  •  •  V{i)  (23) 

A*{ Wr)^(*)M(i5k )  •  •  ’  M(k.k)R(k)M»lJt)- 

Proof:  Proof  is  by  induction  on  k.  For  k  =  1.  it  is  easily  verified  that 

T’tl.l)  *  MO.D^dl^u^CDAtfi.D^y^u)-  (24) 

Assume  that  the  lemma  holds  for  k.  Note  that 


H,lJt+\YRVt+l)fHik+i)V[k+\)P{iJt+i)R(k+i)M{iJn-i)T<kA)lx<v]  (25) 

*  (fcjg)[[*»,  ■  •  •  *k*  1  (*S4i®y*«-i)  *k  •  •  •  *i].y] 
[Vv  •  Vk+e*k+iVk  Vi]] 

s  •  •  •  *fc+aV*+i**  •  •  *i].[yv  •  •  ■  Vk+&k+iVk  *  ■  ■  Vill 

s  [[*h  '  '  •  **>&,+ 1  •  •  •  ViMVv  •  •  •  Vu+9*k+\  •  •  ’  *1]] 

*  r(*4,>+I)[*.v]. 

Using  the  identities 

~  <P[l)MiJj»R(k)P{j}k)  j*k,l*k-j+ 1  (28) 

Mu*)R(fi)PVl*)tM.t.k)R(k)P{ilk)  =  PKi*)R{k)PC&)PHj.k)R{kWj'k)  I*;.  (27) 

we  see  that  the  left  hand  side  of  (25)  is  the  same  as  the  right  hand  side  of  (23) 
which  completes  the  proof. 
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We  now  look  at  the  structure  of  Theorem  1  for  three  cases. 

Case  l:n-k  =  k.  For  this  case,  the  right  hand  side  of  (13)  reduces  to 

6(k)M(i.*)-£M(iik) -  •  •  (28) 

m.Ut)B IMjih)  ■  '  ■ 

Substituting  in  for  T(*,*)  from  Lemma  1  and  using  the  identities 

_  .  f  l+j 

mjM&HjAMivRwvlk)  =  {  l*>  (29) 

and 

~  /*UM)B(^)B/J{j\)  (30) 

leads  to  the  expression 

•  •  •  m*jt)BR{/,)fnkj,)V[i,)  ■  ■  •  5P(i)  (31) 

m.lM)R{h)B 

for  the  case  n  =  2Jfe . 

We  now  have  a  structure  which  is  easily  realizable,  since  f*{jjt)BJl[k)fifjljt)  can 
be  performed  by  a  set  of  butterfly  modules,  half  of  which  reverse  the  order  in 
which  they  output  the  butterfly  results.  The  Mu  *)£(*)£ AH/V)  Ire  a*ao  impl®* 
mented  by  a  column  of  butterfly  modules,  although  half  of  these  reverse  the 
order  in  which  they  accept  their  inputs.  As  we  have  noted  previously,  the 
?(*) ' '  '  <P{ i)  can  be  implemented  with  a  2*  -pole  commutator.  An  example  of  this 
structure  will  be  demonstrated  below. 

Cam  2:  n-k  >  k.  For  this  case,  we  note  that  the  transposition  operator 
in  (13)  is  difficult  to  implement  since  it  changes  the  number  of  data 
streams  (rows)  from  2*  up  to  2""*.  However,  it  is  easy  to  show  that 

T’l*  j»-*)  =  -*.»)  k<n-k.  (32) 

Substituting  this  into  (13),  expanding  T^j,)  by  Lemma  1  and  propagating  the  m’s 
gives  us  the  structure 

(33) 

'  •  M(n-*.n-*)BBfr)M(nl-k *-*)?(*)  ’  ■  ‘ 

which  is  as  easy  to  implement  as  (31). 

Chsc  3:  n-k  <  k .  In  this  case,  the  application  of  the  transposition  operator 
TVn-ft)  in  (13)  results  in  fewer  data  streams  (rows)  .  Consider  the  portion  of 
(13)  which  comes  after  this  operator.  Since  it  is  of  the  same  form  as  the  left 
hand  side  of  (13),  we  can  apply  Theorem  1  again  and  rewrite  the  right  band  side 
of  (13)  as 

tH. *-*»] 

where  the  effect  of  Theorem  1  has  been  bracketed  for  clarity.  The  combination 
of  operators  leaves  the  number  of  rows  constant  and  can  be 

implemented  with  the  help  of  the  following  lemma. 
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Lemma  2: 

i!<)  ‘  ■  •  l«jj)R{k)fMjl.j)'Pv*-j)  ■  ■  (35) 

‘  '  ‘  k>j. 

Hie  proof  of  this  is  similar  to  that  of  Lemma  1  and  is  omitted.  If  we  use  this 
lemma  and  the  identities  (29)  and  (30)  to  expand  (34),  we  arrive  at 

•  •  ■  AK«-fcj»-*)^(n-*)M(n‘--*4»-*)i»(»-fcifc-n)  (36) 

P(1.8» -*0/^(1. «-*)■/? '  '  (Hn -kji-k)R(n-k)BfM.n-k.n-k) 

^(2*-».»-*)AKl.2fc-n)5/i{l)afc-n)  •  •  •  MiU-nJUt-n)RM0At-n.2k-n) 

M{n m  .» )  • 

The  first  half  of  this  structure  is  now  easily  realizable,  and  the  7^2* -n*-*)  will  be 
expanded  in  one  of  the  three  ways  we  have  just  looked  at.  If  necessary,  case  3 
should  be  applied  recursively  to  the  structure  until  the  last  T  operator  can  be 
expanded  as  in  case  1  or  2. 

3.2.4.  Partitions  of  Derived  Structures 

Clearly,  one  should  again  try  to  partition  the  structure  into  chips  by  includ¬ 
ing  on  each  chip  as  many  stages  at  each  pipeline  as  possible.  We  will  again 
denote  the  number  of  butterfly  units  per  chip  by  l .  and  assume  that  it  is  con¬ 
stant,  although  the  following  is  easily  extended  to  those  cases  where  it  is  not. 
From  the  results  of  the  previous  section,  one  can  see  that  communication 
between  the  pipelines  is  required  no  later  than  after  the  first  n-k  stages,  which 
implies  that  l<n-k.  If  l  |n-fc  and  1 1 <n >„_*  (where  <  >m  denotes  the  residue 
mod  m  and  |  is  read  "divides"),  the  structures  of  the  previous  section  can  be 
used  directly.  However,  if  these  requirements  are  not  met,  one  can  use  the 
identities 

(H (37) 
=  9»{»)P(/jk)7?(*)M{jlfc)  l*k-j+l 

and 

(38) 

*  Vlrji-kWUJifluWj1*)  ri*y-f  +  l 

to  move  the  p's  through  the  structure  so  that  there  is  always  a  multiple  of  l 
stages  between  them.  An  example  of  the  this  will  be  shown  below. 

These  identities  may  also  be  used  to  derive  structures  when  l  is  not  a  con¬ 
stant,  as  long  as  no  l  violates  the  inequality 


3.2.5.  Examples  of  Parallel-Pipeline  Structures 

For  the  following  examples,  we  choose  n=8  (Af  =84),  and  derive  structures 
for  various  numbers  of  data  streams  (2*)  and  number  of  butterfly  modules  per 
chip  (0- 

Exampit  1:  k  =3,  l  =3.  This  corresponds  directly  to  case  1  and  is  notated  by 
'  ’ '  M(3S)^(8)M(3S)«»(S)  •  •  -  P(i)  (39) 

•  •  •  /*<a.3) (s^AKaWks.1*) 

where  the  commutator  consists  of  one  8-pole  switch.  This  example  is 
diagrammed  in  figure  2,  where  the  reversal  of  the  butterfly  inputs  or  outputs  is 
shown  by  an  "R"  placed  at  the  input  or  output  of  the  butterfly  module, 


Figure  3:  Commutator  for  Example  1 
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respectively.  The  commutator  required  is  shown  in  figure  3,  where  all  the  inter¬ 
nal  switches  start  in  the  position  shown,  then  switch  at  the  rates  shown  in  the 
figure. 

Note:  "X”  dgnifiu  Ugh  speed  register 


parallel 

data 


Figure  4:  Decimator  for  Example  1 

The  design  for  the  decimator  shown  in  figure  4  assumes  that  the  data  arrives  in 
a  single  stream  (from  an  A/D.  for  example),  and  requires  the  use  of  a  few  fast 
registers.  The  at  the  output  would  in  general  not  be  implemented,  since  its 
implementation  would  require  logic  which  was  2*  times  as  fast  as  the  logic  in  the 
FTT  processor. 

Example  2:  k  =3, 1=2.  Here,  we  modify  (39)  using  the  identity  in  (37)  to 

where  we  now  have  two  commutators,  the  first  consisting  of  2  4-pole  switches 
and  the  second  consisting  of  1  2-pole  switch.  This  configuration  is  diagrammed 
in  figure  5.  In  this  case,  the  first  commutator  consists  of  the  first  two  sections 
and  the  second  commutator  consists  of  the  third  section  of  the  commutator 
shown  in  the  last  example.  Note  that  we  could  just  as  easily  have  moved  pm  to 
the  right  and  combined  it  with  p^)  with  no  change  in  the  operation  of  the  other 
parts  of  the  circuit. 

Example  3:  k=  4, 1=2.  The  first  transposition  operator  can  be  expanded  as 
in  case  3,  and  the  resulting  transposition  operator  can  be  expanded  as  n  case  1. 
The  resulting  structure  is 

(8)M(i!&)9*(M>9»Ci.a)  (41) 

i*U.r>R  &BR  qt)8R  (*>^a!g)p<s)p(i) 

fH  i  BuC&WW1*  W^MdV^V^e)- 


sh®  regiiter  memories  in  first  row  are  representative 
of  entire  column. 


Figure  5:  Example  2  (lb =3,  2=2) 


3.2.8.  Benchmarks  of  the  Derived  Structures 

If  N= 2?  is  the  transform  size,  2*  the  number  erf  data  streams,  and  2  the 
number  of  butterfly  modules  per  chip,  the  number  of  chips  required  to  compute 
the  radix-2  FFT  is 


and  we  can  process 


(43) 


transforms  per  second.  It  should  be  noted  again  that  this  structure,  like  the 
pipeline  it  was  derived  from,  can  process  intermixed  channels  of  length  N2~* . 

For  a  desired  transform  rate,  the  number  of  chips  required  can  be  com¬ 
puted  by  eliminating  k  from  the  above  equations  and  noting  that  k  must  be  an 
integer.  Thus, 


Since  l<n—k,  we  can  derive  a  lower  bound  for  C  given  by 


(44) 


This  is  of  interest,  since  it  states  that  the  dependence  of  the  lower  bound  of  C  on 
Ft  is  worse  than  linear.  Thus,  although  the  speedup  is  optimal  in  the  sense  that, 
to  raise  Ft  by  a  factor  of  2* ,  one  multiplies  the  number  of  butterfly  modules  by 
the  same  factor,  the  speedup  is  not  optimal  if  one  counts  chips.  Figure  6  has  a 
graph  of  C  versus  Ft  T*N  for  the  case  N- 1024,  where  it  is  also  assumed  that  l  is 
limited  to  £4  due  to  area  limitations  on  each  chip. 


figure  6:  Cost  in  number  of  chips 
Since  Jfcin-l,  the  highest  transform  rate  possible  is  given  by 

Ft  (max)  =  (48) 

at  a  cost  of 

~  (47) 

chips.  Hie  latency  is  given  by 

Ti  =  (n-l)7>+nr*+|[-:  (48) 

For  large  transforms,  the  latency  is  reduced  by  2*  over  the  original  pipeline 
structure. 

3.3.  The  Radlx-r  Coaley-Tukey  Algorithm 
3.3.1.  Pipeline  Structure 

The  pipeline  structure  of  Despain  extends  easily  to  arbitrary  radix  by 
including  in  each  stage  a  computational  unit  capable  of  performing  an  r -point 
DFT  and  a  CORDIC  rotator  module.  An  example  of  a  radix-4  FFT  processor  of  this 
form  can  be  found  above  and  in  [l].  If  each  chip  contains  l  of  these  stages,  and 
if  the  transform  size  is  now  given  by  N-rn,  the  number  of  chips  needed  to  per¬ 
form  the  FFT  is  given  by 


and  the  transform  rate  is  again 

jr  -  _J_ 

F*  ~  TdN' 


3.3.2.  Notation 

The  notation  for  the  radix-2  case  can  be  extended  in  a  straightforward 
manner  to  cover  the  radix-r  case.  The  index  of  the  data  is  now  expressed  in 
terms  of  its  base-r  representation  as  [x,y]  =  [[xu  •  ■  ■  x1],[yv  ■  •  •  3/j]].  The 
definitions  of  and  remain  unchanged,  but  the  definitions  of  /?(*)  and 
must  be  generalized  to 

Rtk)[*<y]  =  [X'ivv  vz  <Vi+yk+i>r  ]]  (si) 

=  [x.[yv  •  2/a  <y1-yk4-i>r  ]] 

and 

<P{k.i)lx -v]  =  fclVv  •  ■  •  Mfc+J+1  ~Vk  +j >r  Vk+i- 1  ■  •  '  VlJ]  (52) 

<P{k.0)-<P{k)- 

Note  that  the  p  operator  can  still  be  implemented  as  a  commutator.  In  this 
more  general  case,  Despain's  structure  is  notated  in  the  same  way  as  for  radix- 
2,  with  the  B  operator  now  interpreted  as  a  radix-r  DFT.  Theorem  1  still  holds, 
and  Lemmas  1  are  easily  generalized  as  follows. 

Lemma  1 ; 

T(kjc)  =  •  •  •  PHfc '*k)tLj?jc)V{k)  ■  ■  ■  V{\)  (53) 

■  '  '  tJik.k)ft(k)y-Ck.k)- 

Lemma  2: 

T(kjAk-j)  =  lH\J)R{k)tMxi)  •  •  ■  (5*) 

fHi.J)RCk)fJii]j)  ■  ■  fHj.i)R[k)IMjj)  k>3- 

The  structures  for  the  different  cases  are  identical  to  the  structures  derived 
before,  except  that  the  Rfa  operators  will  be  propagated  to  the  left  and  the  Rfa 
operators  will  be  propagated  to  the  right.  For  example,  the  n-k  -  k  case  would 
ger  2ralize  to 

&{k)»\i.k)BR{k)ti{\x)  ■  ■  ■  Vikj')BRfc)[Jiklk)<P(k)  ■  ■  ■  <P(i)  (55) 

miMftwButfjt)  ■  ■  •  mk*)R (k)B Wi1* ) ■ 

This  is  again  easily  realizable,  although  the  data  at  the  input  and  output  to  the 
radix-r  DFT  modules  will  need  to  be  rearranged  in  a  more  complicated  pattern 
(although  there  will  still  be  no  buffering  required.)  Structures  for  the  other 
cases  are  generalized  similarly. 

3.3.3.  Benchmark  of  the  Radix-r  Structures 

If  the  number  of  data  streams  is  given  by  rk,  and  if  l  is  the  number  of  com¬ 
putational  units  per  chip,  the  number  of  chips  is  given  by 


-  ZL  . 

7] 


and  the  transform  rate  by 
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Following  --i»s  same  argument  as  before,  we  can  arrive  at  a  lower  bound  for  C  for 
a  given  Ft . 

Cfc  - - Ff  T^N.  (58) 


The  highest  possible  transform  rate  is  given  by 
Ft(max)  =  — - 

at  a  cost  of 
Nn 


chips.  The  latency  becomes 

Ti  =  (n-l)7>+«7fl+-“-  (61) 

T 

where  Tg  is  now  the  time  it  takes  to  do  a  radix-r  DFT  and  7V  is  the  time  required 
for  a  C0RD1C  rotation. 

3.4.  The  Use  of  the  Winograd  Algorithm  in  Pipeline  Processors 

3.4.1.  Background 

Certain  DFT  sizes  are  easier  to  implement  than  others.  Although  in  the 
past,  powers  of  2  have  been  a  common  choice  due  to  the  Cooley-Tukey  algorithm 
[4],  there  is  often  a  large  advantage  to  employing  other  sizes  as  will  be  seen 
below.  In  fact,  it  is  possible  to  reduce  the  number  of  multiplies  to  0(N)  for  cer¬ 
tain  values  of  N.  The  algorithms  which  achieve  this  are  based  on  the  reduction 
of  DFTs  to  convolutions  by  Rader  [12],  the  Good  prime  factor  algorithm  [5],  and 
the  combining  of  multipliers  due  to  Winograd  [8]. 

We  have  already  seen  that  the  important  idea  in  all  the  FFT  algorithms  U  to 
factor  the  DFT  and  thereby  reduce  the  number  of  operations  over  its  dfe'eni  cal¬ 
culation.  Tins  corresponds  to  a  factorization  of  N.  the  transform  size.  &nee  we 
are  interested  mainly  in  pipeline  organizations  of  FFT  processors,  each  of  the 
factors  of  N  will  correspond  to  a  pipeline  module. 

3.4.2.  Module  Implementations 

34.2.1.  Base  2"  Modules 

We  have  already  discussed  these  modules  in  detail  and  there  is  quite  a  bit  in 
the  open  literature  about  them.  The  pipeline  processors  of  Despain  as  described 
in  [1]  and  [2]  would  be  of  the  most  interest  here. 

34.2.2.  Base  3  Modules 

The  derivation  of  the  base  3  module  begins  with  the  DFT  for  N= 3 
At  =  S  5*  IT*  r  =0,1,2  (1) 


where 


Define 
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01s  B\+Bz 
a0  =  Ui+Bq 
ag  —  ay-ZBg. 

Equation  (1)  can  now  be  factored  into 
i4o  =  a0 

a  -  a  3ai  .  ag^ 

A0  g  ^  g 

Ag  =  Ai-dgV^J. 

The  signed  flow  graph  of  this  transform  is  shown  in  figure  7. 


1  ZX 

:7.\1  V.\ 


figure  7:  Signal  Flow  Graph  of  Base  3  DFT 

It  can  now  be  observed  that  this  calculation  requires  7  real  additions,  one  multi¬ 
plication  by  V3.  and  several  shift  operations. 

The  term  V3  can  be  approximated  by  a  ratio  of  simple  integers  as  in  [2], 
The  result  is  that  only  about  4  complex  additions  per  data  point  are  required  for 
the  base  3  transform. 

If  the  algorithm  of  figure  7  is  to  be  realized  in  pipeline  form,  considerable 
data  reordering  is  required.  Thus,  we  will  use  a  slight  modification  of  the  signal 
flow  graph.  This  is  illustrated  in  the  circuit  diagrams  below.  Figure  8  shows  the 
overall  base  3  circuit.  Figures  9  and  10  show  the  add/subtract  portions  of  the 
base  3  circuit,  while  the  multiplication  module  could  be  implemented  as  either  a 
full  multiplier  circuit  or  as  a  rational  approximation  as  shown  in  figures  11 ,  12  , 
and  13.  If  the  base  3  module  is  not  the  last  module  in  a  cascade,  shift  registers 
would  be  necessary  to  multiplex  the  data  as  was  done  for  the  base  2"  modules 
previously  derived. 

From  these  figures  it  can  be  seen  that,  while  the  number  of  arithmetic 
operations  is  small,  the  complexity  of  rearranging  the  data  is  large.  In  particu¬ 
lar  it  would  be  costly  to  employ  a  base  3  module  at  the  front  of  a  large  FIT  pipe¬ 
line  due  to  the  large  shift  register  memory  which  would  be  required  relative  tc 
the  base  2"  modules. 

A  similar  analysis  of  other  prime  factors  such  as  5,  7,  and  11  indicate  that 
the  complexity  of  rearranging  the  data  grows  very  quickly  with  the  value  of  the 
prime.  Because  it  is  severe  for  the  case  jV=3,  by  the  time  the  case  N =5  is  con¬ 
sidered,  the  complexity  negates  many  of  the  advantages  of  the  prime  factor 
technique,  especially  for  the  pipeline  processors  considered  here.  The  problem 
is  not  so  great  for  specialized,  single  random  access  memory  processors. 


r- 
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DATA 

IN 


DATA 

OUT 

indtx 


a:  Circuit  Diagram 


b:  Macro  Symbol 


Figure  9:  First  Half  of  Base  3  Butterfly 


ADD/PASS 

Figure  10:  Last  Half  of  Base  3  Butterfly 

a  4. 2. 3.  Base  S  Module 

There  are  several  approaches  to  deriving  a  base  5  algorithm  suitable  for 
pipeline  cascade  processors.  If  the  algorithms  outlined  by  Ifinograd  [8]  and 
developed  in  detail  by  Koiba  and  Parks  [13]  are  to  be  employed,  then  the  flow 
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a:  Block  Diagram 


b:  Macro  Diagram 


Figure  11:  Base  3  Specialized  Multiplier  Circuit 


a:  Circuit  Diagram 


b:  Macro  Symbol 

Figure  12:  Divide  by  2  Circuit 

graph  of  figure  14  results.  Although  this  form  of  the  algorithm  could  be 
employed  with  the  use  of  input  and  output  buffers  and  buffers  for  temporary 
storage,  it  is  better  to  derive  an  algorithm  that  is  inherently  in  the  pipeline 
form. 

This  algorithm  will  be  derived  to  meet  the  following  constraints: 


Figure  14:  Base  5  Signal  Flow  Graph 

1.  pipeline  organization 

2.  Winograd  form  (central  multipliers) 

3.  minimized  multiplies 

4.  minimized  memory  requirements. 

The  first  step  is  due  to  Rader  [12].  As  an  example,  we  will  look  at  N=5.  Ve 
can  write  the  OFT  in  matrix  form  as 

1  1  1  1  [p 

IT1  W*  W9  W*  B\ 

W 9  W*  Wl  W9  Bt 

W9  W'  W*  W9 

W*  W9  W9  W'  ^ 


1 

i 
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wtaere  W  =  • 

“i 
*« 

®a 

“4 


The  difficult  part  of  this  calculation  is 


W'  IT*  W*  W4 
W*  W4  Wl  W* 
W*  Wl  W*  W* 
W4  W*  W*  Wl 


B 1 
B2 
B s 

k 


(1) 


since 


Ag  s 


and 


Ag  =  5o+°i 
*  dg^St 

i4g  s  fig^Og 
Ag  =  i?o+ O4. 

Since  5  is  prime,  we  are  guaranteed  [14]  that  we  can  find  a  primitive  root.  g. 
such  that  gkmod5  for  Is  =  0,1,2,...  forms  the  set  j 1, 2,3,4}  which  is  the  set  of  all 
the  positive  integers  less  than  5.  This  primitive  root  defines  a  permutation  of 
the  set  { 1, 2,3,4}  by  applying  the  function  gkmadb  to  the  numbers  JO.  1,2,3 J  which 
gives  us  Jl.2,4,3}.  If  we  apply  this  permutation  to  the  computation  in  (1),  switch¬ 
ing  the  last  two  rows  and  columns  of  the  matrix,  we  end  up  with 


*1 

wl  w*  w*  w* 

Bi 

*2 

w*  w4  w*  wl 

B a 

>4 

w4  w*  w1  w* 

B 4 

*3 

w*  wl  W  w4 

*3 

This  can  be  recognized  as  the  cyclic  correlation  of  P  and  B  where 

P  *  [wl  w*  w4  w*]T 
B  ®  [b1  b*  b*  b*)t. 


It  is  well  known  that  convolutions  and  correlations  can  be  performed  with  the 
DFT  in  the  following  manner  [IS] 

d  =  dft~Kdft{P)xdft{B))  (10) 

where  x  is  a  component  by  component  multiply.  It  can  be  shown  [6, 16]  that  the 
DFT  of  P,  which  can  be  precalculated,  is  always  pure  real  or  pure  imaginary,  so 
that  the  multiplications  which  need  to  be  performed  involve  at  worst  one  real 
and  one  complex  value.  The  calculation  of  the  DFT  of  B  and  the  inverse  DFT  in 
(10)  can  be  performed  by  the  4  point  DFT  algorithm  which  has  already  been 
defined.  A  signal  flow  graph  of  the  entire  N- 5  algorithm  is  shown  in  figure  15. 
This  algorithms  shows  a  great  deal  more  regularity  than  that  of  Kolba  and  Parks 
which  we  saw  previously,  and  is  thus  much  more  suitable  for  a  pipeline  organiza¬ 
tion.  A  circuit  which  performs  this  algorithm  is  shown  in  figures  16,  17,  16,  and 
19.  The  multiplier  W6  is  most  easily  realized  as  a  full  multiplier  circuit. 


S.4.2.4.  Base  7  Module 

The  derivation  of  the  base  5  algorithm  above  can  be  used  as  a  prototype  for 
the  base  7  algorithm.  The  primitive  root  3  defines  the  permutation  1 1,3,2,6,4,51 
which  is  used  to  reorder  the  inputs  and  outputs.  Figures  20,  21,  22,  23,  and  24 
show  the  form  of  the  circuit.  First  the  input  data  is  reordered  according  to  the 
sequence  given  above.  Then  a  base  6  DFT  is  applied.  This  is  just  a  base  3  fol¬ 
lowed  by  a  base  2  transform,  since  3  and  2  are  relatively  prime.  Next  a 


sJ!i 
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a;  Block  Diagram 


b:  Macro  Symbol 


Figure  17:  First  Half  of  Base  5  Butterfly 


a:  Block  Diagram 


b:  Macro  Symbol 

Figure  18:  Last  Half  of  Base  5  Butterfly 
£4.2.7.  Base  17  Module 

Since  we  have  a  good  base  18  algorithm,  base  17  is  attractive  as  well.  The 
primitive  root  3  defines  the  permutation  { 1,3,9, 10, 13,5, 15, 11. 16. 14,8.7,4, 12,2,6{. 
Again  only  a  new  reordering  circuit  is  needed  as  in  figures  28  and  27. 

3. 4. 2.8.  Higher  Bases 

Above  17,  the  Rader/Winograd  form  of  DFT  algorithms  becomes  more 
difficult. 

3.4.2. 9.  Proposed  FTT  Cascade 

We  have  now  developed  a  number  of  modules  which  can  be  employed  to 
form  a  full  FFT  Cascade.  The  central  multiplications  of  the  base  5,  7.  13,  and  17 
modules  can  be  combined  into  a  single  central  multiplication.  The  various 
transform  sizes  which  can  be  obtained  with  the  restriction  that  only  one  multi¬ 
plier  be  used  is  quite  large. 

Choose  any  combination  of  the  a*  such  that  each  a*  is  used  only  once  and 
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a:  Stock  Diagram, 


b:  Macro  Symbol 


Figure  21:  First  Half  Base  7  Butterfly 


a:  Stock  Diagram, 


b:  Macro  Symbol 

Figure  22:  Last  Half  Base  7  Butterfly 
N  »  n%  =  17,821,440  . 

i 

This  should  be  large  enough  for  most  purposes.  Figure  28  shows  an  FFT  cascade 
of  this  size  and  smaller  cascades  can  easily  be  derived  from  this  flgure.  Some 
attractive  values  are 

N  *  48,  258.  788.  2304,  4352,  13056,  39168. 

Note  that  multiple,  multiplexed  channels  of  shorter  transform  length  can  be 
obtained  by  tapping  the  cascade  structure  as  shown. 

314.2.10.  Module  Memory  Costa 

Each  module  has  two  different  components  to  its  cost.  The  first  is  the 
memory  cost  which  is  a  function  of  the  position  of  the  module  in  the  pipeline. 
Define  a  factor  l  that  represents  the  product  of  all  the  bases  of  the  modules  that 
follow.  The  factor  l  then  represents  the  number  of  samples  to  be  stored  in  the 
shortest  memory  (shift  register).  For  the  base  &  modules,  21  memory  words 
will  be  required  since  a  single  sample  has  both  a  real  and  imaginary  part.  By 
adding  up  the  memory  segments  from  the  previous  figures,  the  relative  costs  of 


a:  Circuit  Diagram 


b:  Macro  Symbol 


Figure  23:  Reorder  Network  for  Base  ?  Module 


PASS/EXCHANGE 
a:  Circuit  Diagram 


b:  Macro  Symbol 

Figure  24:  Excbange  Circuit  Module  "E" 

memory  for  each  type  module  can  be  determined.  The  results  are  given  in  table 

1. 


CONTROL 
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a:  Circuit  Diagram 


Figure  25:  Reorder  Network  for  Base  13  Module 


a;  Circuit  Diagram 


b:  Macro  Symbol 
Figure  28:  Base  16  Reorder  Network 
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a;  Circuit  Diagram 


l:  H/te 


Figure  27:  Base  17  Reorder  Network 
N/49  N/840  N/1880  N/2 1.180  N/ 37  1.280 


Figure  28:  FFT  Cascade  for  N  =  17,821,440 
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Table  1:  Module  Memory  Costs 

Size 

Weighting  Factor  (a) 

Cost (ul )  j 

2 

1.0 

2.0 

3 

2.0 

6.0 

5 

2.5 

7 

2.33 

16.33  ! 

13 

2.77 

36.0  I 

17 

3.75 

63.75  i 

The  second  major  cost  factor  is  fixed  for  each  type  of  module.  This  cost  is 
determined  by  the  number  of  adders  and  2-input  multiplexers,  grouped  under 
the  term  "Adds”.  Table  2  summarizes  this  cost  for  each  module  type. 


Table  2:  Module  "Adds"  (Central  Multiply  not  included) 

Size  I  Number  of  "Adds" 

2 

4 

3 

12  .. 

4 

10 

5 

20 

7 

40  ; 

13 

64 

17 

68 

1 


p 


i 

! 

»' 

3 


1 
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4.  IMPLEMENTATION  TECHNOLOGY 

Due  to  the  pipeline  organization  of  the  FFT  processor,  it  was  originally 
thought  that  charge  transfer  technologies  such  as  "Bucket  Brigade"  or  "Charge 
Coupled  Devices"  would  be  the  proper  approach  to  take.  We  will  first  give  a  brief 
review  of  the  principle  of  operation  of  charge  transfer  devices  and  of  the  basic 
implementation  of  such  circuitry.  Subsequently,  we  will  discuss  the  difficulties 
in  technology  and  layout  that  arise  in  the  implementation  of  practical  systems. 
Finally,  we  present  the  reasons  why  the  charge-transfer  approach  was  aban¬ 
doned  and  standard  silicon-gate  n-channel  MOS  technology  was  favored  for  the 
implementation  of  the  prototype  building  blocks. 

4.1.  Review  of  the  Charge-Transfer  Principle 

"Charge  Transfer  Device"  (CTD)  [3]  is  a  generic  term  which  has  come  to  be 
applied  to  a  family  of  functional  solid-state  electronic  devices  which  includes 
Bucket  Brigade  Devices  (BBD)  and  Charge-Coupled  Devices  (CCD).  Under  the 
application  of  a  proper  sequence  of  clock  pulses,  these  devices  move  quantities 
of  electrical  charge  in  a  controlled  manner  across  a  semiconductor  substrate. 
Using  this  basic  mechanism,  they  can  perform  an  amazingly  wide  range  of  elec¬ 
tronic  functions  including  image  sensing,  data  storage,  logic  operations,  and  sig¬ 
nal  processing.  Because  of  the  shift-register  nature  of  these  devices,  they  are  a 
natural  match  to  serial  memory  or  to  pipelined  signal  processing  systems. 

There  are  two  basic  approaches  to  forming  charge-transfer  devices.  In 
bucket  brigade  structures  information  is  represented  by  majority  carriers,  e.g. 
the  holes  in  the  p  -type  diffused  regions  constituting  the  source  or  drain  areas 
of  a  p-channel  MOS  transistor  (Fig  29).  Electrically  a  bucket  brigade  device  can 
be  understood  as  a  dynamically  operated  chain  of  pass  transistors.  Under  the 
influence  of  two  clocks  half  the  pass  transistors  are  strongly  turned  off  at  any 
one  time,  while  the  others  provide  potential  barriers  that  permit  to  sldm  of  the 
signal  charge  from  the  background  of  majority  carriers  contained  in  the  source 
electrodes.  Capacitive  coupling  of  the  clocks  to  the  diffused  electrodes  between 
the  pass  gates  will  properly  bias  these  areas  to  make  them  act  as  sources  or 
drains  respectively. 

In  the  charge  coupled  devices,  a  more  sophisticated  electrode  structure  is 
employed  to  create  moving  potential  wells,  that  travel  along  the  surface  of  the 
silicon  crystal.  Information  is  contained  as  a  packet  of  minority  carriers  in 
these  moving  potential  wells.  Practically  all  the  charge  contained  in  one  poten¬ 
tial  well  location  gets  transfered  to  the  subsequent  position.  Because  the  signal 
charge  does  not  have  to  be  skimmed  off  a  majority  carrier  background,  it  is 
easier  to  obtain  good  "transfer  efficiency". 

The  crucial  performance  parameter  in  both  kind  of  devices  is  'transfer 
inefficiency',  a  fractional  number  that  indicates  wbat  part  of  the  signal  charge 
fails  to  get  transferred  properly  and  gets  mixed  into  the  subsequent  signal 
packet.  Bucket  brigade  devices  have  typical  transfer  inefficiencies  of  10-’  to 
10"4  per  stage  while  CCDs  achieve  on  the  order  of  10-0  per  transfer.  Overall 
transfer  inefficiency  of  a  charge  transfer  structure  between  input  and  output  or 
between  subsequent  signal  regenerators  should  not  exceed  50%  for  digital  appli¬ 
cations.  This  determines  the  maximum  number  of  stages  that  can  be  safely  put 
into  a  single  charge  transfer  section.  Analog  bucket  brigade  shift  registers  with 
several  hundred  stages  can  be  built  with  acceptable  signal  degradation.  On  the 
other  hand,  CCD  delay  lines  with  up  to  10,000  electrodes  can  be  built  with  good 
performance. 

While  BBD’s  are  normally  implemented  with  only  two  clock  phases.  CCDs 
have  been  built  with  from  1  to  4  sets  of  clocked  electrodes.  Devices  with  3  or '4 
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Figure  29  (a):  Schematic  rendering  of  a  p  'Channel  BBD  with  the  associated 
potential  diagram  shown  in  the  cross  section  of  the  silicon  substrate.  (b,c): 
Potential  diagrams  shown  for  various  biasing  conditions  illustrating  the 
transfer  of  charge,  (d):  The  corresponding  time  slots  marked  in  the  di¬ 
agram  of  the  clock  waveforms. 


electrodes  per  stage  can  use  simple  unstructured  electrodes,  while  the  1  or  2- 
phase  devices  need  to  have  some  structure  built  into  each  electrode  in  order  to 
uniquely  define  the  direction  of  charge  transfer.  The  typical  means  to  define 
this  directionality  is  to  use  a  step  in  the  thickness  of  the  insulating  oxide  layer 
under  the  gate  electrode  or  a  shallow  implant  at  the  surface  of  the  substrate  to 
produce  a  suitable  asymmetry  in  the  interface  potential  underneath  each  elec¬ 
trode.  In  both  cases  the  signal  charge  will  then  accumulate  in  the  part  of  the 
electrode  where  it  has  the  lower  potential  energy  and  will  be  prevented  from 
moving  backwards  by  the  barrier  part  of  the  potential  profile.  For  this  to  work, 
the  amount  of  signal  charge  must  be  restricted  to  be  completely  contained 
behind  the  barrier.  For  the  same  clock  voltages  and  identical  areas  of  the 
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Figure  30  (a):  Schematic  rendering  of  a  3-phase  n -channel  CCD  with  the 
charge  carrying  potential  wells  shown  in  the  cross  section  of  the  silicon  sub¬ 
strate.  (b.c,d):  Potential  wells  shown  at  subsequent  time  intervals  illustrat¬ 
ing  the  transfer  of  charge,  (e):  The  corresponding  time  slots  marked  in  the 
diagram  of  the  waveforms. 


storage  electrodes,  the  charge  handling  of  devices  with  directional  electrodes  is 
thus  smaller  than  that  of  devices  with  simple,  uniform  electrodes. 

Bucket  brigade  devices  have  implanted  or  diffused  source  drain  electrodes 
and  asymmetrically  arranged  metal  or  silicon  electrodes  that  serve  simultane¬ 
ously  as  transistor  gates  and  as  capacitors  that  properly  bias  the  source  and 
drain  electrodes.  These  devices  can  be  built  with  a  single  technological  gate 
level.  The  area  underneath  the  gap  between  the  gate  electrodes  is  bridged  by 
the  strongly  doped  source/drain  areas  (figure  29).  Charge  coupled  devices  on 
the  other  hand  move  minority  carriers  through  lightly  doped  substrate  regions 
close  to  the  surface.  The  potential  of  all  these  areas  must  be  carefully 
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controlled.  Inter-electrode  gaps  lead  to  unpredictable  signal  handling  and  poor 
reliability.  Thus  the  whole  active  channel  area  must  be  covered  with  clock  elec¬ 
trodes.  This  normally  implies  the  use  of  at  least  two  conductive  levels  capable  of 
providing  good  MOS  gates  or  the  use  of  special  technological  tricks  such  as 
selectively  doped  sheets  of  high-resistivity  polysilicon.  The  normal  CCD  struc¬ 
ture  thus  typically  consists  of  two  or  more  levels  of  partially  overlapping  gate 
electrodes  (figure  30). 

Because  of  the  difficulty  of  routing  different  sets  of  clocks  to  all  paths  of  a 
large  charge  transfer  system,  efforts  have  been  made  to  build  CCDs  with  only  a 
single  clocked  electrode  which  covers  the  whole  channel.  It  may  at  first  seem 
surprising,  but  such  structures  are  indeed  possible,  and  experimental  devices 
have  been  built  in  several  laboratories.  However,  these  structure  typically 
require  a  more  complicated,  very  tightly  controlled  fabrication  process,  ari 
provide  only  very  small  signal  handling  capability  measured  as  a  fraction  of  the 
applied  clock  amplitudes.  We  are  not  aware  of  any  practical  systems  that  have 
been  built  with  such  uni-phase  CCDs. 

4.2.  Implementation  Trade-oflh 

There  are  a  few  fundamental  trade-offs  in  the  construction  of  charge 
transfer  devices.  As  mentioned  above,  signal  handling  can  be  traded  off  versus 
the  number  of  clock  phases.  Uni-phase  devices  can  carry  very  little  charge  per 
volt  of  the  applied  clock  signals.  Two-phase  devices  have  reasonable  signal  han¬ 
dling  capabilities.  From  three  phases  on  up  the  signal  handling  is  very  good,  but 
the  problem  of  routing  all  these  clock  phases  to  the  proper  points  gets  worse 
with  increasing  number  of  clocks.  The  figure  of  merit:  (maximum  signal  charge) 
/  (number  of  clock  phases)  reaches  an  optimum  at  four  phases. 

Similarly  there  is  a  trade-off  in  the  number  of  clock  phases  that  need  to  be 
routed  to  the  charge  transfer  channel  versus  the  sophistication  of  the  imple¬ 
mentation  technology.  All  practical  charge  coupled  systems  need  at  least  two 
levels  of  gate  electrodes.  This  is  true  even  for  the  uni-phase  device  because  of 
the  input/output  structures.  In  addition,  the  two-phase  devices  need  at  least 
one  to  two  implants  in  the  area  of  the  transfer  channel  to  provide  the  necessary 
directionality  for  the  electrodes.  Uni-phase  devices  need  at  least  three  to  four 
implants  (or  corresponding  oxide-patterning  steps  to  provide  stepped  elec¬ 
trodes).  'Hie  dosage  of  these  implants  have  to  be  very  carefully  controlled  to 
guarantee  proper  operation  of  the  device. 

Bucket  brigade  devices  can  be  constructed  with  both  sets  of  electrodes 
belonging  for  the  two  clock  phases  in  a  single  level  of  metal  or  poly-crystalline 
silicon.  They  can  thus  be  constructed  with  standard  n-channel  or  p-channel  MOS 
technology.  However,  serial  registers  with  good  transfer  efficiency  are  normally 
much  larger  than  a  corresponding  CCD  implementation. 

In  all  these  devices  there  is  only  a  single  plane  in  which  the  signal  charge 
can  move  around.  The  transfer  of  these  charge  packets  can  occur  only  close  to 
the  silicon  crystal  surface.  Crossing  of  two  signal  paths  is  thus  only  possible  at 
the  expense  of  considerable  extra  circuitry.  Either  the  charge  packets  belong¬ 
ing  to  the  two  separate  channels  are  time-multiplexed  through  the  crossing 
point,  which  requires  extra  clocks  and  control  gates;  or  at  least  one  of  the  signal 
streams  must  be  taken  out  of  the  charge  domain  and  converted  with  a  sense 
amplifier  to  a  corresponding  voltage.  This  voltage  or  current  signal  can  then  be 
transmitted  in  a  wire  across  the  charge  transfer  channel  containing  the  other 
signal  path.  The  voltage  or  current  signal  can  then  drive  an  injector  circuit  that 
recreates  a  new  charge  packet  of  corresponding  size  and  injects  this  into 
another  charge  transfer  channel.  In  both  cases  the  extra  amount  of  silicon  area 
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and  power  required  make  such  signal  path  crossings  quite  unattractive. 

It  has  often  been  pointed  out  that  VLSI  chips  will  become  ever  more  "wire 
limited".  The  active  devices  themselves  get  smaller  and  faster  because  of  the 
scaling  laws  that  apply  to  practically  all  MOS  technologies.  However,  as  the  cir¬ 
cuits  scale  down,  all  wiring  will  increase  in  resistance  and  will  contribute  in  an 
ever  increasing  proportion  to  the  overall  delay  in  the  system.  In  addition,  unless 
the  structure  of  the  overall  system  Layout  is  planned  very  carefully,  the  wiring  of 
the  system  will  use  an  ever  larger  fraction  of  the  chip  area.  Thus  one  must  give 
preference  to  those  algorithms  that  use  as  few  long  distance  interconnections 
and  global  signals  as  possible.  This  makes  the  one-  and  two-phase  clock  systems 
much  more  attractive. 

Another  serious  limitation  to  the  overall  system  complexity  allowable  on  a 
single  chip  is  power  dissipation.  At  lower  pulse  frequencies  the  NMOS  and  PMOS 
circuits  are  dominated  by  static  power  dissipation.  Equivalent  circuits  could  be 
built  with  I^L,  CMOS  or  CTD  technologies  that  consume  2  to  4  orders  of  magni¬ 
tude  less  power.  At  frequencies  above  10  MHz  these  differences  are  reduced  to 
one  or  two  orders  of  magnitude  as  shown  in  figure  31  [17]. 
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Figure  31:  Power  Dissipation  among  Various  Technologies 

4.3.  A  Technology  for  VLSI  WT  Processors 

Early  on  in  the  program  we  studied  the  trade-offs  between  the  various 
implementation  technologies  that  could  be  used  for  the  construction  of  the 
basic  building  blocks  of  a  fast  pipelined  FFT  VLSI  processor.  Dobrowolslri.  [18] 
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compared  different  implementations  of  the  important  butterfly  module  in  vari¬ 
ous  MOS  technologies.  The  technologies  considered.  NMOS,  PMOS,  CMOS,  PL, 
and  CCD,  were  compared  in  terms  of  active  area  and  layout  complexity  as  well 
as  power  dissipation  and  speed  (through  simulation).  The  key  result  was  that 
the  charge  transfer  approach  did  not  look  attractive  at  all  for  the  implementa¬ 
tion  of  the  core  logic  modules  in  an  FFT  processor  in  which  the  data  is 
represented  in  a  parallel  digital  format.  The  signal  flow  graph  of  the  FFT 
butterfly  module  or  the  CORDIC  rotator  module  contains  far  too  many  topologi¬ 
cally  unavoidable  signal  path  crossings.  This  would  require  that  the  signal 
representation  constantly  must  switch  from  the  charge  domain  to  a 
voltage/current  representation.  It  is  then  much  more  appropriate  to  imple¬ 
ment  the  logic  blocks  using  restoring  logic  with  small  charge  steering  networks 
of  pass  transistors  interspersed,  both  of  which  can  be  fabricated  using  standard 
MOS  technology. 

Even  for  the  implementation  of  serial  memory  charge  transfer  devices  do 
not  look  very  attractive  anymore.  For  small  blocks  of  memory  the  possible  sav¬ 
ings  in  area  and  power  dissipation  compared  to  almost  any  dynamic  or  static 
memory  block  are  negligible  since  the  overhead  of  the  relatively  complicated 
peripheral  control  circuitry  dominates.  In  this  case  then  one  would  prefer  to 
use  a  type  of  memory  that  can  be  readily  fabricated  with  the  same  technology 
as  is  used  for  the  logic  modules  for  easy  integration  of  the  whole  system.  If  the 
memory  block  has  to  be  fairly  large,  then  power  dissipation  becomes  a  crucial 
issue.  A  purely  serial  memory  would  be  unacceptable  since  the  power  would  be 
proportional  to  the  number  of  bits  moved,  rather  than  the  number  of  bits 
stored.  Since  in  a  purely  serial  memory  all  bits  move  in  every  clock  cycle,  the 
power  consumption  can  become  prohibitive. 

Most  tricks  that  have  permitted  the  charge  coupled  memories  to  reach 
rather  high  bit  densities  have  now  been  adopted  by  the  designers  of  the  large 
dynamic  RAMs  as  well,  so  that  the  density  advantage  of  CCDs  no  longer  outweighs 
the  more  difficult  fabrication  process. 

Based  on  this  comparative  analysis  we  have  decided  to  concentrate  on  the 
readily  available  NMOS  technology  for  the  implementation  of  the  prototypes  of 
the  logic  modules  needed  in  a  VLSI  FFT  processor. 
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5.  DESCRIPTION  OF  INTEGRATED  CIRCUITS 
5.1.  16-Point  DFT  Processor 

In  [2],  Despain  describes  a  pipeline  processor  for  computing  a  16-point 
DFT.  This  processor,  shown  in  figure  32,  consists  of  four  basic  modules,  one 
which  computes  the  butterfly  operation  of  the  FFT,  and  three  which  perform 
various  vector  rotations. 


Figure  32:  16  point  DFT  processor 

The  tt/  18  and  n/  8  rotators  work  by  rational  approximation  as  described  in  [2]. 
Of  these  two,  only  one.  the  n/ 18.  was  actually  implemented,  although  a  more 
general  vector  rotator  which  is  capable  of  rotating  a  vector  by  any  angle  was 
also  designed,  and  could  take  the  place  of  the  collection  of  n/n  modules  in  a 
processor  design.  The  more  general  vector  rotator  could  also  be  used  to  build 
processors  for  computing  transforms  of  much  larger  size  as  described  earlier  in 
this  paper.  The  n/  2  rotation,  since  trivial,  was  included  on  the  butterfly  module 
chip. 

5.2.  Sjrstam-wide  Considerations 
5.2.1.  Bit  Skewing 

All  data  words  (16  bit  integers)  in  the  processor  are  skewed  bitwise  so  that 
the  least  significant  bit  of  the  word  arrives  at  the  chip  one  clock  cycle  before  the 
next  least  significant  bit  and  so  on.  This  allows  the  carry  from  the  addition  of 
one  pair  of  bits  to  propagate  while  the  next  most  significant  pair  of  bits  is  arriv¬ 
ing.  Thus,  the  irregularity  of  full  carry-lookahead  adders  is  avoided,  and  a  chain 
of  simple,  one-bit  carry-save  adders  cam  be  used.  The  effect  of  this  is  to 
increase  the  throughput  of  the  pipeline  processor,  since  one  can  make  the  clock 
cycle  equal  to  the  time  necessary  to  perform  a  one  bit  addition  instead  of  a  16 
bit  addition.  However,  latency  is  increased  for  several  reasons.  First,  there  is 
the  obvious  first-order  effect  due  to  the  fact  that  16  clock  cycles  are  required  to 
input  or  output  one  datum.  Clearly,  this  would  be  negligible  in  any  signal¬ 
processing  application.  The  more  serious  effect  is  due  to  the  fact  that  shift  and 
add  operations,  which  comprise  the  whole  of  the  C0RD1C  algorithm,  introduce  a 
Latency  equal  in  size  to  the  magnitude  of  the  shift  For  example,  if  bit  3  of  a 
data  word  is  to  be  added  to  bit  7  of  the  same  data  word,  bit  3  must  be  stored  in  a 
register  until  bit  7  arrives  four  clock  cycles  later.  In  an  n-bit  CORDIC  vector 
rotator,  the  latency  due  to  this  effect  would  be 

Shift  latency  =  £>  =  ^ 

where  n  is  the  number  of  bits  of  accuracy  of  the  CORDIC  operation.  The  regis¬ 
ters  which  are  necessary  for  this  intermediate  storage  also  increase  the  area  of 
the  chip  by  a  significant  amount.  In  fact,  in  the  16-bit  CORDIC  rotator,  these 
registers  accounted  for  60%  of  the  active  area  of  the  datapath. 


CASCADE  DATA  OUT 


Figure  33:  Block  Diagram  of  Root  3  Circuit 

If  the  high  latency  or  the  increased  area  of  the  bit-skewing  technique  where 
a  problem,  a  tradeoff  can  be  made  by  skewing  the  data  words  in  Mocks  of  m 
bits,  where  m<n.  This  relieves  both  problems  at  the  expense  of  reducing 
throughput,  since  now  the  basic  clock  cycle  must  be  on  the  order  of  the  time 
necessary  to  perform  an  m  bit  addition,  unless  the  adder  itself  is  pipelined. 

6.Z.Z.  Multiplexing  at  the  Real  and  Imaginary  Parts 

It  was  also  decided  at  an  early  stage  to  multiplex  the  real  and  imaginary 
parts  of  the  complex  data  vectors  through  the  same  pins.  Although  this  reduces 
the  throughput,  it  would  have  been  impossible  to  have  built  the  butterfly  chip 
any  other  way,  since  the  number  of  pins  this  chip  uses  is  right  at  our  current 
limit  of  84.  Also,  it  reduced  the  complexity  of  the  crossover  problem  a  great 

deal,  especially  in  the  CORDIC  chip  and  the  rotator,  which  otherwise  would 
have  had  to  have  global  chip  communications  at  every  stage  of  the  algorithm. 
Rotation  by  used  at  the  front  end  of  the  butterfly  chip,  became  trivial,  since 

it  merely  entailed  the  use  of  a  buffer  to  reorder  the  real  and  imaginary  parts  of 
the  data,  whereas  global  communications  would  have  been  required  if  the  data 
paths  for  the  two  parts  had  been  separate. 
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MPC79  BI/2  example  circuit 

16  BIT  CONSTANT  MULTIPLIER  sel*. 


Figure  34:  16  bit  Root  3  Circuit 

s.a  Root  3  Circuit 

In  the  prime  factor  algorithm  of  Good  [5]  large  transforms  are  built  up 
from  smaller,  relatively  prime  factors.  The  advantage  of  the  technique  is  that 
no  twiddle  factors  are  necessary  as  in  the  Cooley-Tukey  algorithm,  although  the 
complexity  of  data  rearrangement  is  much  higher.  In  [2]  Despain  suggested  an 
algorithm  for  performing  the  base-3  DFT  which  could  be  used  in  conjunction  with 
a  radix-2  FFT  processor  handle  transform  sizes  of  the  form  3x2"  without  the 
need  for  twiddle  factors.  One  of  the  basic  computations  of  the  base-3  DFT  is  a 
multiplication  by  M3,  which  can  be  performed  by  the  use  of  a  rational  approxi¬ 
mation.  If  one  is  willing  to  accept  an  arbitrary  gain  factor  in  the  result  of  the 
DIT,  one  can  then  multiply  the  entire  DFT  equation  by  the  denominator  of  the 


Figure  35:  Fabricated  Root  3  Chip 

rational  approximation,  thus  limiting  the  necessary  computations  to  constant 
real  multiplies.  For  sixteen-bit  accuracy,  a  good  approximation  is  given  by  ~~ 

The  use  of  this  approximation  also  minimizes  the  number  of  shifts  and  adds 
necessary  to  perform  the  operation,  since  a  multiplication  by 

285  =  (28+l)23+l 

requires  2  shifts  and  2  adds  and  a  multiplication  by 
153  -  (2*+l)2s+(24-n) 


also  requires  2  shifts  and  2  adds.  In  addition,  it  is  easy  to  build  hardware  capa¬ 
ble  of  performing  both  multiplications. 

Figure  33  is  a  detailed  block  diagram  of  the  chip  as  it  was  actually  imple¬ 
mented.  The  blocks  marked  SR  shift  a  datum  right  one  bit,  while  the  blocks 
marked  ADDER  are  one  bit  adders.  Due  to  area  limitations,  the  chips  was  real¬ 
ized  as  a  bit-slice,  requiring  three  chips  for  the  full  18-bits  of  precision  as  shown 
in  figure  34.  The  input  data  are  applied  on  pins  DI1-DI8  and  the  output  data  are 
received  on  pins  DO1-DO0.  The  outputs  SOI- SOB  and  C01-C02  would  be  con¬ 
nected  directly  from  the  first  (second)  chip  in  the  cascade  to  the  inputs  Sli-SIB 
and  C11-C12  on  the  second  (third)  chip.  When  the  input  SEX  is  high,  the  chip  mul¬ 
tiplies  by  205,  and  when  SEL  is  low,  the  chip  multiplies  the  input  data  by  153. 
Unfortunately,  since  the  chip  was  designed  before  the  bit-skewed  data  format 
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Figure  36:  Schematic  of  Barrel  Shifter 

was  decided  upon,  it  is  incompatible  with  the  later  chips  which  used  that  format 

The  fabricated  chip,  shown  in  figure  35,  was  tested  and  found  to  be  opera¬ 
tional  up  to  8  MHz,  which  was  the  limit  of  our  test  equipment  at  the  time.  The 
power  consumption  was  measured  to  be  32  ma  quiescent  and  60  ma  at  6  MHz. 

5.4.  Barrel  Shifter 

The  use  of  the  CORDIC  algorithm  for  vector  rotation  in  the  computation  of 
the  DFT  has  already  been  discussed.  Hardware  capable  of  performing  this  algo¬ 
rithm  has  as  one  of  its  basic  building  blocks  a  suitable  shift  network.  A  prelim¬ 
inary  study  of  a  programmable  barrel  shifter  capable  of  left  shifts  of  arbitrary 
size  was  done  and  an  8-bit  version  was  designed  and  fabricated.  A  schematic  of 
this  circuit  is  shown  in  figure  36.  The  chip  has  8  data  inputs,  3  control  inputs 
which  specify  the  number  of  bits  by  which  the  data  word  is  to  be  shifted,  and  15 
data  outputs.  The  input  data  enter  the  chip  on  the  lines  marked  AQ-A7  in  figure 
la  and  pass  into  the  array  of  "C"  cells  seen  on  the  right  side.  Each  of  these  cells 
is  capable  of  sending  a  datum  straight  through  or  shifting  it  to  the  left  depend¬ 
ing  on  the  state  of  the  S0-S2  control  signals. 

The  finished  chip  (shown  in  figure  37  )  was  tested  and  found  to  be  opera¬ 
tional  at  clock  rates  up  to  10.4  MHz. 


11  D«lay 


Figure  38:  Block  Diagram  of  -rr-  Circuit 

16 

In  the  final  version  of  the  CORDIC  rotator  which  is  discussed  below,  the 
shifts  were  hardwired  rather  than  handled  by  programmable  shifters  at  each 
stage.  However,  a  need  for  a  programmable  shifter  would  arise  in  a  lower  per¬ 
formance  iterative  CORDIC  unit  which  used  the  same  hardware  to  process  all  the 
stages  of  the  CORDIC  algorithm. 


6.5.1.  Theory  of  operation 

In  [2]  De Spain  discusses  the  use  of  rational  approximations  for  rotations. 

In  particular,  algorithms  for  and  £-  rotations  are  developed  which  are 

lo  0 

optimum  in  the  sense  that  they  reduce  the  number  of  additions  necessary  to 
achieve  the  accuracy  desired.  The  algorithm  which  was  implemented  was  for  a 
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■~-rotation  and  can  be  written  as 
15 

r,  =  (2'z+l)r0 
ii  =  (2“*+l)i0 
r2  =  r{¥  2_zij 
ig  =  it±2~ir1 
followed  by 

r3  -  r2± 2“10i2 
i3  =  igT2~10r2 

where  r0  and  i<)  are  the  real  and  imaginary  parts,  respectively,  of  the  vector  to 
be  rotated.  This  algorithm  is  similar  to  the  full  CORDIC  algorithm  in  that  it  con¬ 
sists  only  of  shifts  and  adds. 

Figure  38  and  figure  39  show  a  block  diagram  and  CIFPLOT  of  the  completed 
circuit.  The  layout  and  function  of  the  chip  is  very  similar  to  that  of  the  C0RD1C 
rotator,  and  thus  will  not  be  explained  in  great  detail.  In  fact,  the  major 
difference  is  that  this  algorithm  consists  of  only  three  stages,  whereas  the  full 
COKDIC  algorithm  requires  sixteen  for  the  same  accuracy.  This  forced  a 
different  aspect  ratio  on  the  basic  cells  to  avoid  a  tall  and  narrow  chip. 


iU|f  0  stag*  1  stag*  15 


Figure  40:  Block  Diagram  of  CORDIC  rotator 

S.6.  CORDIC  Rotator  Chip 
5.8. 1.  Theory  of  Operation 

The  chip  was  specified  to  work  with  18  bit  two's  complement  data  words 
which  set  the  number  of  iterations  to  17.  Since,  in  a  Fourier  Transform  proces¬ 
sor,  the  rotation  angles  are  known,  we  have  assumed  that  the  a*  have  been  pre¬ 
viously  computed  and  are  delivered  to  the  CORDIC  rotator  chip  by  the  control 
circuitry  j. robably  a  ROM).  Given  a  complex  input  vector  r04-ifj,  the  algorithm 
can  now  be  expressed  as 


Figure  41:  Floor  Plan  of  CORDIC  rotator 
IF  a0=O  THEN  DO 

tx  *■  -i0 
*1 «-  r0 

ELSE  IF  og=  1  DO 

r,  «-i0 
i,  «-  -r0 

FOR  *«- 1.18  DO 

IF  dfc=0  THEN  DO 

*-rk+ik2-k+l  (3) 

4*i  *-4 

ELSE  IF  a*  si  THEN  DO 

Ttir I  *■  Tt-itZ-**1 

where  the  a*  are  externally  supplied  control  signals  which  determine  the  order 
of  addition  and  subtraction  at  each  stage.  The  k  =0  stage  is  a  ±  ~ rotation  which 
is  necessary  if  one  wishes  to  rotate  by  angles  from  +rr  to  — ir. 

Since  the  chip  was  specified  to  work  in  a  pipeline  DFT  processor,  it  was 
implemented  as  a  pipeline,  with  each  iteration  being  handled  by  a  separate 
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piece  of  hardware.  This  allows  the  multiplications  by  2*  to  be  performed  by 
hardwired  shifts  between  each  iteration  stage.  Also,  due  to  the  bit  skewing 
throughout  the  DFT  processor,  we  have  not  used  full  carry-lookahead  adders, 
but  have  utilized  one  bit  carry-save  adders  which  allow  the  carries  to  propagate 
before  they  are  needed.  The  operation  of  a  set  of  these  carry-save  adders  will  be 
explained  in  detail  below.  Hus  turned  out  to  have  an  advantage  in  that  the  regu¬ 
larity  of  the  entire  chip  was  greatly  increased,  reducing  the  design  time  consid¬ 
erably. 

Looking  at  the  block  diagram  in  flgure  40  and  the  floor  plan  in  figure  41,  one 
can  see  that  the  data  comes  in  on  the  left  and  flows  through  the  17  iterations  of 
the  C0RD1C  algorithm.  The  two-phase  clocks  used  by  the  chip  are  assumed  to  be 
generated  and  driven  by  circuitry  off  chip  (since  they  run  through  the  entire 
DFT  processor).  On  the  chip,  the  clocks  run  vertically  across  the  entire  circuit 
in  metal  along  with  power  and  ground.  At  the  top  of  the  block  diagram  are  the 
ate  control  signals.  As  one  can  see  from  the  algorithm,  the  a*  determine  whether 
the  shifted  half  of  the  vector  is  added  to  or  subtracted  from  the  other  half  at 
each  stage.  Since,  for  any  given  input  vector,  all  17  of  the  a*  are  input  at  the 
same  time,  each  a*  must  be  delayed  so  that  it  will  reach  the  stage  it  is  to  con¬ 
trol  at  the  same  time  as  the  data.  These  delays  are  accomplished  by  entering 
each  a*  into  simple  shift  registers  (pass  transistors  and  inverters)  of  the  proper 
length.  The  reordering  buffers  and  the  reordering  buffer  control  are  discussed 
below. 

5.6.2.  Detailed  Description  of  Data  Path  and  Control 

We  will  now  look  closely  at  the  computation  of  one  iteration  of  the  algorithm 
(stage  5)  as  shown  in  flgure  42.  The  notation  \  (6)  signifies  an  adder/subtractor 
in  stage  k  which  handles  bit  6  of  the  data  word.  This  module  is  shown  in  detail 
in  flgure  43.  Note  that  6  takes  values  from  -4  to  15.  since  we  keep  4  guard  bits 
in  the  partial  results  as  recommended  by  Walther  [19],  The  input  to  the  adder 
marked  ±  will  be  added  to  or  subtracted  from  the  input  marked  +  depending  on 
the  value  of  the  control  input  pk  (b ),  where  k  and  6  again  denote  the  stage  and 
bit.  respectively.  Since  we  are  working  in  two’s  complement,  subtraction  is  per¬ 
formed  by  complementing  the  ±  input  and  holding  the  carry-in,  ck(b),  high. 
This  is  realized  by  connecting  c0(-4)  to  ps(- 4),  since  a  high  value  on  pk(6) 
signifies  subtraction.  Note  that  the  signal  p5(-4)  is  merely  the  a9  of  the  algo¬ 
rithm  which  has  been  delayed  as  mentioned  in  the  previous  section.  In  addition 
to  the  sum  output  which  appears  on  the  right,  each  adder  also  produces  two 
control  signals  which  are  passed  on  to  the  next  adder  in  the  chain.  One  of  these 
is  the  pk{b)  input  delayed  by  one  clock  cycle,  and  the  other  is  the  carry-out 
resulting  from  the  addition.  We  will  also  use  the  notation  rk(6)  and  i*(b)  to 
denote  bit  6  of  the  real  and  imaginary  parts  of  the  data,  respectively,  at  stage 
k. 

The  operation  of  this  stage  is  as  follows.  During  phi2,  A4(- 4)  produces  the 
output  rg(-4),  which  is  entered  into  the  shift  register  5Z?9(-4)  during  phil.  At 
the  next  phi2,  the  same  adder  will  produce  ts(— ■ 4),  which  will  go  into  the  same 
shift  register.  By  the  algorithm  in  (3),  we  can  derive  the  following: 

rs(~4)  =  ’’sC^tiaCO)  (4) 

*a(“ 4)  =  is  (-4)Wr#(0) 

which  correspond  to  a  shift  of  2~*  and  an  add/subtract  operation.  Since  the  bits 
of  the  data  flowing  through  the  chip  are  skewed  so  that  the  higher  order  bits 
trail  the  lower  order  bits,  the  outputs  rs(0)  and  i9(0)  from  A^O)  will  be  com¬ 
puted  4  clock  cycles  after  the  outputs  of  A«(-4)  in  (4).  Thus,  the  outputs  of 


Figure  42:  Stage  5  of  the  C0RD1C  Rotator 

>t4(-4)  need  to  be  delayed  for  at  least  4  clock  cycles  to  allow  the  outputs  of  i44(0) 
to  catch  up.  However,  note  that  the  outputs  of  i44(0)  are  not  in  the  correct 
order  for  (4)  to  be  computed.  This  necessitates  the  use  of  a  reordering  buffer 
(denoted  by  /?s(-4)),  which  inputs  r4(0)  on  phil,  stores  it  in  a  shift  register, 
allows  i4(0)  to  pass  undelayed  on  the  next  phil,  then  allows  r4(0)  to  be  input  to 
Aa{-r)  during  the  following  phil.  The  control  of  the  reordering  buffer  will  be  dis¬ 
cussed.  later.  Note  that  the  added  delay  in  this  module  requires  that  the  outputs 


Figure  43:  Adder  Module 

of  4»(-4)  be  delayed  by  5  clock  cycles  altogether  before  being  entered  at  the 
next  phil  into  A0(—4).  An  identical  sequence  of  events  occurs  for  the  two  data 
inputs  to  Aa(-3),  except  that  the  entire  sequence  occurs  one  clock  cycle  later. 
In  addition.  A0(-4)  holds  onto  its  carry-out  until  phil.  when  it  is  passed  to  Aof-S) 
along  with  the  delayed  pa(-4)  control  signal.  At  the  bottom  of  the  figure,  the 
modules  labeled  S*(b)  perform  the  sign-extension  necessary  for  two’s  comple¬ 
ment  arithmetic  by  delaying  (by  shift  register)  the  output  of  /?a(ll)  by  one  clock 
cycle  each.  Since  the  output  of  this  reordering  buffer  is  the  sign  bit  of  the  previ¬ 
ous  stage’s  output,  this  is  the  correct  operation. 

In  general,  stage  *  in  the  iteration  consists  of  a  shift  by  2-**1,  20-*  + 1 
reordering  buffers,  a  set  of  delays  of  length  *.  fc-1  sign-extension  delays,  and  a 
set  of  20  adder/subtractor  modules. 

The  control  circuitry  for  the  chip  consists  of  the  delays  for  the  a* 
(described  earlier)  and  the  control  circuit  for  the  reordering  buffers.  The  chip 
requires  that  the  user  supply  a  high  level  to  the  input  REALIN  whenever  the  least 
significant  bit  of  the  input  data  is  from  the  real  part  of  an  input  vector,  and  a 
low  level  whenever  it  is  from  the  imaginary  part  of  an  input  vector.  This  signal 
has  two  functions.  First,  it  inverts  the  a*  input  whenever  an  imaginary  number 
is  being  input,  thus  deriving  the  two  p*(b )  signals  necessary  for  the  addition  and 
subtraction  in  each  iteration  of  the  algorithm.  The  circuit  which  performs  this 
inversion  is  shown  in  figure  44. 
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Figure  44:  Inversion  of  a* 

Second,  it  is  entered  into  a  shift  register  from  which  the  control  signals  to  the 
reordering  buffers  are  tapped  as  shown  in  figure  45.  When  the  appropriate  stage 
of  the  shift  register  contains  a  high  level,  the  signal  rx  is  produced  which  causes 
the  reordering  buffer  to  store  its  present  input  as  the  real  part  of  the  data  vec¬ 
tor.  When  this  level  is  passed  to  the  next  stage  of  the  shift  register,  producing 
the  signal  im,  the  data  at  the  input  of  the  reordering  buffer  must  be  the  ima¬ 
ginary  part  of  the  data  vector,  and  is  thus  allowed  through  without  delay. 


Figure  45:  ROB  and  ROB  Control  Signals 

Similarly,  the  signal  ra  causes  the  previously  stored  real  part  to  be  sent  to  the 
adder  as  explained  above.  When  the  REALIN  signal  has  propagated  the  length  of 
the  shift  register,  it  is  brought  of!  the  chip  as  REALOUT  to  signal  the  fact  that 
the  least  significant  bit  of  the  output  data  is  real.  This  allows  the  chips  in  the 
OFT  processor,  which  all  use  the  same  data  format,  to  be  cascaded  with  no  extra 
circuitry. 

A  C1FPL0T  of  the  finished  chip  is  shown  in  figure  46. 

5.6.3.  Performance  Estimation 

Since  the  chip  is  designed  as  a  pipeline,  the  performance  is  limited  by  the 
longest  delay  between  any  two  dynamic  registers.  The  longest  delay  in  the  chip 
occurs  in  the  adder  module,  which  contains  the  only  substantial  combinatorial 
logic  on  the  chip.  A  SPICE  run  on  this  circuit  reported  a  worst  case  delay  from 
input  to  carry-out  of  approximately  40ns.  The  only  significant  wiring  which 
might  affect  the  circuit’s  performance  exists  between  the  output  of  the  a* 
delays  and  the  adders,  but  this  was  hand  estimated  to  be  less  than  40ns,  and  so 
would  not  reduce  the  clock  rate  further.  Thus,  the  overall  maximum  clock  rate 
should  be  expected  to  be  10  to  12  MHZ. 


Figure  48:  GFPLQT  of  C0RD1C  Rotator  Chip 
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5.6.4.  Testing 

Due  to  area  limitations,  there  are  no  registers  between  the  stages  of  the 
pipe  and  thus  it  is  impossible  to  look  at  Intermediate  results  of  the  CORDIC  cal¬ 
culation  on  a  fabricated  chip.  However,  it  would  be  possible  to  add  a  register  at 
the  output  of  each  adder  module  and  string  these  together  if  one  is  willing  to 
give  up  area  to  do  it.  One  could  turn  of!  phi2  at  the  output  of  the  adder  and 
enable  a  pass  transistor  which  would  allow  the  sum  output  to  be  chained  to  the 
next  adder  using  a  shift  register  stage.  Clearly,  this  would  require  at  least  two 
pass  transistors  and  two  inverters  for  each  signal  one  wished  to  chain  together 
in  this  way,  along  with  at  least  one  more  control  line  running  down  the  adder 
chain.  Since  this  would  easily  take  the  chip  past  10mm  in  length,  it  was  not 
implemented. 

5.7.  Butterfly  Circuits 

5.7.1.  Introduction 

The  butterfly  computation  takes  two  complex  inputs  \  and  Bk  and  com¬ 
putes  the  outputs  C„  and  Du  according  to  the  rule 

Ck=A+Bh  (1) 

Du  ~  Au-Bu 

where  Bk,  Ck,  and  Du  are  complex 

A  preliminary  study  of  various  technologies  for  implementing  the  butterfly 
chip  was  undertaken  early  on  in  the  research.  A  4-bit-slice  NMOS  chip  was  built 
at  that  time  to  test  the  speed  of  the  adder  that  had  been  designed  while  verify¬ 
ing  simulation  results.  This  chip  did  not  use  the  multiplexed  real  and  imaginary 
format  and  is  thus  incompatible  with  the  CORDIC  rotator  without  the  use  of 
extensive  support  circuitry.  However,  another  NMOS  chip  was  later  designed 
which  did  use  this  format  and  could  be  used  in  conjunction  with  the  CORDIC 
rotator  to  build  FFT  processors  of  arbitrary  size. 

Following  the  structure  of  the  DFT  processor  in  [2],  both  chips  were 
designed  to  operate  in  the  following  way.  First,  the  input  data  in  (1))  is 
passed  into  a  shift  register  until  it  is  full.  At  this  point,  the  chip  is  switched  into 
add  mode  and  the  data  stored  in  the  shift  register  is  combined  with  the  incom¬ 
ing  data  ( Bu )  to  form  the  sum  and  differences  according  to  (l).  The  sum  output 
(Q,)  is  sent  immediately  through  the  output  pins  to  the  next  stage  of  the  proces¬ 
sor,  while  the  difference  outputs  (Du )  are  stored  in  the  vacant  stages  of  the  shift 
register.  When  the  C*  have  all  been  passed  out,  the  shift  register  is  then  emp¬ 
tied  through  the  output  pins  by  switching  the  chip  back  to  pass  mode. 

57.2.  Preliminary  Butterfly  Processor 

Hie  preliminary  butterfly  chip  consisted  of  a  set  of  four  1-bit 
adder/subtractor  units  which  could  be  used  to  build  up  arbitrarily  large 
butterfly  processors.  A  block  diagram  of  a  16-bit  butterfly  processor  built  from 
these  modules  is  seen  in  figure  47  along  with  a  single  module  whose  inputs  and 
outputs  are  shown.  A  and  B  are  the  data  inputs,  C  and  D  are  the  data  outputs, 
Ci,  Co,  Bi,  Bo  the  carry  and  borrow  inputs  and  outputs  from  the  other  chips  in 
the  cascade,  and  Mi  and  Mo  the  mode  input  and  output  which  determine  whether 
the  circuit  is  in  pass  or  add  mode  as  described  above.  The  circuit  itself  was 
implemented  with  a  NOR  PLA,  since  it  is  a  regular  structure  which  is  can  be 
made  fairly  compact  and  fast. 


Figure  47:  16  Bit  Butterfly  Processor 

The  resulting  1-bit  module,  which  can  be  seen  replicated  four  times  in  the 
finished  chip  (figure  48)  measured  420  by  657  fj,  with  \  -  2.5p»  A  ring  oscillator 
circuit  was  set  up  to  measure  the  PLA  delay,  yielding  a  result  of  120ns.  Note 
that  this  is  quite  a  bit  slower  than  the  circuit  which  was  later  devised  for  the 
CORDIC  rotator.  Power  consumption  was  measured  to  be  15  mw  per  PLA  with 
Vdd  =  5v. 

6.7.3.  Compatible  Butterfly  Processor 

For  convenience,  it  was  decided  to  Include  a  rr/  2  rotator  at  the  front  end  of 
this  chip  which  is  realized  by  a  circuit  which  performs 

*•(&)  *  (2) 
=  Ttfe(5*) . 

Also,  enough  shift  register  memory  was  included  on  chip  so  that  a  18-point  DFT 
processor  could  be  built  without  the  need  for  any  external  memory.  This 
memory  can  be  bypassed  for  larger  transforms. 


figure  49:  Floorplan  of  Butterfly  Module 
and  the  schematic  in  figure  SO,  the  chip  is  utilized  as  follows. 


figure  50:  Schematic  of  Butterfly  Module 

An  internal  or  external  shift  register  is  chosen  through  the  use  of  the  input 
"extrdel".  When  '’extrdel”  is  high  then  the  internal  shift  register  is  disabled 
and  an  external  register  can  be  connected  to  the  pins  Tout  and  Tin. 

If  the  internal  shift  register  is  chosen,  the  length  must  be  set  through  the 
use  of  inputs  delO  through  del4.  Table  3  shows  the  settings  required  for  the 
length  of  shift  register  desired. 


Table  3:  Choosing  Shift  Register  Length 


Extrdel  I  DelO  Dell  j  Del2  I  Del3  i  Del4  DELAY 


3.  The  input  "pass/add"  is  set  high  (pass). 

4.  The  As  are  entered  sequentially,  real  part  first,  then  imaginary  part.  The 
input  "realin"  must  be  high  when  entering  the  least  significant  bit  of  the 
real  part  and  low  when  entering  the  least  significant  bit  of  the  imaginary 
part. 
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5.  When  the  shift  register  is  full,  the  input  "pass/add"  is  set  low  and  the  Bk  are 
entered  in  the  same  way  as  the  4  except  that  one  must  also  set  the  inputs 
“wO"  and  "wl"  to  determine  the  rotation  angle  desired.  These  inputs  must 
be  stable  during  the  input  of  least  significant  bits  of  both  the  real  and  ima¬ 
ginary  parts  of  B k.  The  relationship  of  these  inputs  to  the  rotation  angle  is 
shown  in  table  4. 


!  Table  4:  Choosing  rotation  angle 

|  wl  I  wO  i 

angle 

0 

0 

0 

0 

1 

rr/2 

1 

0 

rr 

J-J 

_ l _ 

-rr/2 

The  input  "realm"  must  still  be  sequenced  correctly. 

8.  After  all  the  Bk  have  been  entered,  the  "pass/add”  input  is  set  high  and  the 
Dj,  are  output.  The  output  "realout",  when  high,  signals  the  real  part  of  an 
output  datum.  The  next  set  of  A*  can  be  entered  simultaneously  with  this, 
and  steps  5-6  can  be  repeated  indefinitely. 

The  rr/  2  rotator  is  implemented  as  recommended  by  Despain  in  [ 2 ].  If  the 
Bk  are  to  be  rotated  by  ±rr/2  (determined  by  "wO"),  the  real  and  imaginary 
parts  are  interchanged  in  the  reordering  buffer  (ROB),  as  is  required  by  (2).  The 
change  of  sign  in  (2)  is  accomplished  by  merely  inverting  the  plus  or  minus  con¬ 
trol  signal  to  the  adder /subtree  tors.  If  the  B *  are  to  rotated  by  n,  the  plus  or 
minus  control  signals  are  inverted  without  interchanging  the  real  and  imaginary 
parts.  If  the  Bk  are  to  rotated  by  an  angle  of  0,  nothing  is  done. 

5.7.4.  Chip  Description 

The  size  of  the  chip  is  3.03x2.70  nun  square  which  was  determined  by  the 
large  number  of  pads  necessary  for  I/O. 

The  adder/subtractor  circuit  uses  a  MUX  to  generate  the  sum  and  carry 
outputs,  which  makes  its  operation  relatively  slow.  The  circuit  configuration  of 
the  programmable  delay  is  shown  in  figure  51. 


Figure  51:  Programmable  Delay  Circuit 

Superbulfers  are  used  in  between  the  delay  stages  so  that  with  any  number  of 
delays  programmed  the  feedback  loop  is  driven  quickly. 
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The  modules  have  been  topologically  arranged  to  minimize  wiring  but  there 
are  long  metal  wires  that  connect  the  outputs  to  the  pads.  However,  this  should 
not  impose  a  severe  penalty  on  the  speed  of  operation  since  we  have  used 
superbuffer  drivers. 

5.7.5.  Performance  Estimation 

The  speed  of  the  butterfly  module  depends  on  the  speed  of  the  adder  since 
it  is  in  the  critical  path.  The  adder,  which  is  a  conventional  Mead  and  Conway  [9] 
design,  was  SPICE  simulated  and  the  carry  out  and  sum  propagation  delay  was 
100ns.  Since  there  is  only  one  add  taking  place  in  phase  1  of  the  cycle  this 
phase  can  be  the  same  as  this  delay.  So,  assuming  a  50ns  nonoverlap,  the  cycle 
of  the  clock  can  be  300ns.  Therefore  the  data  rate  through  the  butterfly  module 
is  3.3MHz.  In  the  programmable  delay  module  and  the  feedback  loop  to  the 
input  of  the  adder  superbuffers  are  used  to  avoid  long  delays. 

5.7.6.  Testing 

There  is  no  means  cm  chip  to  test  intermediate  internal  state,  although 
some  sections  of  the  circuit  can  be  operated  fairly  independently  from  the  rest. 
For  example,  one  could  select  the  external  shift  register  and  thus  have  access  to 
the  inputs  and  outputs  of  both  of  the  add/subtract  chains,  although  the  data 
would  pass  through  some  other  gates  and  the  rr/2  rotator.  The  internal  shift 
register  can  only  be  loaded  and  flushed  through  the  adder/subtractors. 
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6.  THEORETICAL  WORK 

8.1.  Minimum  Latency  Transforms 

6.1.1.  Justification 

In  traditional  signal  processing,  throughput  has  been  the  only  important 
measure  of  performance.  However,  in  many  applications,  latency  is  a  very 
important  issue.  It  is  important  to  see  that  the  pipeline  processors  which  we 
have  been  discussing  are  geared  only  toward  high  throughput,  and  that,  in  fact, 
pipelining  any  computation  will  increase  the  throughput  at  the  expense  of 
increased  latency.  One  example  of  possible  importance  to  the  Army  is  the  real 
time  side-lobe  cancellation  of  jamming  signals  fed  into  arrays  in  which  beam¬ 
forming  is  accomplished  via  the  FFT.  Low  latency  is  achievable  both  through  the 
use  of  structures  which  have  this  quality  Innately,  and  through  the  development 
of  fast  circuits  which  allow  all  DFT  structures  to  run  more  quickly. 

Since  both  the  butterfly  and  CORDIC  modules  make  extensive  use  of  adders, 
the  latency  of  these  adders  is  an  important  part  of  the  entire  system  latency. 
The  speed  of  the  adders  is  governed  by  the  propagation  of  the  carry  bit  during 
the  addition.  Either  many  pipeline  stages  must  be  inserted  into  the  adder  cir¬ 
cuit  or  fast  carry  circuits  must  be  devised.  The  insertion  of  pipeline  stages  will 
provide  high  bandwidth  computation  be  at  the  cost  of  increased  latency  in  pro¬ 
viding  the  result  Thus  a  compromise  is  generally  made.  Some  pipelining  and 
some  carry  propagation  within  a  pipeline  stage  are  employed.  As  a  result,  it  is 
important  to  use  fast  carry-lookahead  circuits  to  keep  the  bandwidth  as  high  as 
possible.  The  basic  design  problem  of  fast  carry  look-ahead  circuits  is  to  realize 
a  fast  circuit  with  a  minimum  of  gates.  Further,  the  speed  and  the  cost  of  the 
circuits  depend  on  the  fan-in  and  fan-out  capabilities  of  the  gates  used  to  imple¬ 
ment  the  circuit.  The  basic  speed  of  a  gate  depends  on  these  same  factors  in  a 
negative  and  non-linear  way  as  well.  Thus  an  optimum  design  must  consider  no* 
only  the  number  of  gate  delays  but  the  interaction  of  the  design  with  the  gate 
delay  time  itself. 

6.1.2.  What  is  the  Absolute  Minimum? 

Clearly,  it  would  seem  that  minimum  latency  would  be  achieved  through  the 
use  of  the  original  DFT  equation,  where  all  N2  products  are  computed  simultane¬ 
ously,  and  these  products  are  summed  in  groups  of  N  to  form  the  outputs.  The 
difficulty  is  that  this  would  require  an  adder  fan-in  of  N  to  compute  the  sum  in 
one  add  time  which  is  generally  not  practical  nor  available  in  communication 
bound  VLSI  designs.  The  common  method  which  overcomes  this  difficulty  is  the 
use  of  fan-in  trees  which  add  /  numbers  at  a  time  and  come  to  the  solution  of 
the  larger  problem  after  log fN  add  times.  Similarly,  one  does  not  have  multi¬ 
pliers  which  can  fan-out  to  N  adders  simultaneously,  so  that  one  is  forced  to  util¬ 
ize  a  tree  of  multiplexers  or  redundant  multipliers  to  communicate  the  pro¬ 
ducts  to  their  destinations.  The  result  of  both  of  these  observations  is  that  one 
may  as  well  use  FTT-like  structures  of  the  Cooley-Tukey  or  Good  type  merely  due 
to  the  limited  fan-in  and  fan-out  one  has  available. 

6.1.3.  VLSI  Fan-in  and  Fan-out  Considerations 

Consider  the  circuits  in  figure  52.  If  one  finds  that  the  circuit  one  desires 
can  be  simplified  by  the  use  of  higher  fan-in  gates,  one  would  like  the  gate  on  the 
right  half  of  the  figure  to  have  a  delay  no  greater  than  the  delay  of  the  circuit  on 
the  left.  In  fact,  for  most  logic  families,  this  is  true,  even  for  fan-in  as  large  as 
12-15.  Of  course,  the  use  of  these  gates  in  a  circuit  may  cause  second  order 
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Figure  52:  Fan-in  Comparison 

effects  such  as  increased  wire  lengths  to  come  into  play,  reducing  the  speedup 
for  very  large  fan-in.  Also,  some  circuits  cannot  make  effective  use  of  large  fan- 
in  gates.  However,  we  will  see  that  adders  can  reap  great  rewards  through  the 
use  of  gates  of  fan-in  greater  than  2. 


6. 1.4.  Fast  Carry  Lookahead 

Ladner  and  Fischer  [20]  developed  a  method  of  reducing  the  computation 
time  of  linear  and  many  nonlinear  recurrences  from  O(N)  time  to  OilagzN) 
time  by  transforming  the  recurrences  to  binary  trees.  We  will  apply  these  tech¬ 
niques  to  fast  carry  lookahead  circuits  and  extend  them  to  higher  fan-in  and 
fan-out  [21, 22]. 

On  the  left  side  of  figure  53  is  a  recurrence  which  has  been  expressed  in 
terms  of  a  binary  operation  denoted  by  "O". 


Figure  53:  Desired  Transformation 

The  idea  is  to  transform  this  to  the  tree  on  the  right  side  of  the  figure  which 
clearly  does  the  computation  more  quickly.  This  transformation  requires  the 
operation  to  satisfy  an  associativity  property,  but  does  not  require  linearity,  for 
example.  Figure  54  shows  circuits  for  various  numbers  of  inputs  where  the  nota¬ 
tion  P&(n)  is  a  circuit  of  n  inputs  with  fan  in  j.  The  subscript  k  allows  us  to 
index  circuits  of  different  cost/performance,  so  that  fc=  0  implies  the  highest 
performance  structure,  k- 1  the  next  lower  performance  structure,  and  so  on. 
Note  that  the  figure  also  shows  how  these  circuits  can  be  built  up  recursively 
from  circuits  of  smaller  size. 

For  the  general  fan-in  case,  the  circuits  of  figure  55  result.  Again,  it  is  seen 
in  figure  56  that  the  circuits  can  be  built  up  in  a  recursive  fashion. 

Let  us  now  proceed  to  develop  circuits  for  carry-lookahead.  The  operation 
of  a  full  adder  circuit  for  the  ith  bit  can  be  expressed 

Si  * 


Figure  34:  Definition  of  Pj?(n)  Circuits 

where  ©  means  "exclusive  or".  The  recurrence  is  clearly  seen  in  the  expression 
for  Q  and  is  generally  accepted  to  be  the  hard  part  of  the  calculation.  Since  it 
is  relatively  easy  to  form  the  S<  once  we  have  calculated  the  Q,  we  will  concen¬ 
trate  on  speeding  up  the  carry  generation  part  of  the  circuit.  Figure  57  shows 
the  well  known  "ripple  carry"  circuit.  Here,  the  circled  part  of  the  circuit  is  not 
in  the  proper  form  to  immediately  apply  the  preceding  ideas,  since  it  does  not 
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Figure  58:  Development  of  Carry  Operator 

Figure  59  shows  the  construction  of  this  node  for  higher  fan-in.  Although 
the  operator  is  now  in  the  proper  form  to  use  the  Pl{n)  circuits  directly,  an 
additional  speedup  can  be  realized  by  noting  that  this  operator  is  "asymmetri¬ 
cal"  in  the  sense  that  the  delay  from  gz  to  go  is  two  units  while  the  delay  from  g  j 
to  go  is  one  unit  Therefore,  our  constructions  should  be  modified  to  take  this 
into  account.  A  diagram  of  several  of  these  modified  circuits,  denoted  by  $(n) 
is  shown  in  figure  80.  It  is  easily  seen  that,  for  example,  the  longest  path 
through  the  circuit  (3)  is  3  units  using  the  carry  operator  above,  while  the 
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Figure  58:  Recursive  Construction  of  High  Fan-in  Circuits 

straightforward  use  of  the  P$  (3)  circuit  would  have  resulted  in  a  path  with  a  4 
unit  delay. 

For  the  general  fan-in  case,  figure  61  shows  the  construction  of  the  highest 
performance  Qi(Nm)  circuits.  Lower  performance  circuits  can  be  realized  as  in 
figures  62,  63,  and  84.  An  example  for  fan-in  3  is  shown  in  figure  65. 

The  total  delay  of  our  carry  lookahead  circuit  is  thus 

.  .  f  0  i/m =0,1 

(flay  =  T(m,j,k)  =  [  m+k  otherwise 

and  the  size  of  the  highest  performance  (k  =0)  circuit  (number  of  bits  wide)  is 
size  =  N(m ,j )  =  -2,;');  N(0,j)  =  N(Uj)  -  1. 

Table  5  shows  the  various  sizes  which  are  available  for  a  given  fan-in  and  delay. 
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Figure  62:  General  Construction  of  ffl(JV_) 
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Figure  83:  Construction  of  P(Nm) 

Hie  cost  of  the  high  performance  circuits  is  given  by  the  construction  of  Qi 
which  yields 

cosf  =  Si(m)  =  Si  (m  -l)+0  ~2)5{  (m  -2)+ 5^  (m  -2)+(j  +  l)(ATm-i+(j  -Z)Nm  -2) 


5l{(m)  =  Ul+^+SUim) 

«&> = o-jy*4? 


f  ^+1  ifm  i 
~\  0  else 


is  even 


slU)*  tt-rljlttfl. 

flf  *  =  >+l  poles 

if  =  *f*i  =  54(i)  gales 

£^*£4-i+0-i)c4-e 
KU-i -*-0-DMU-* 

where  ^jj,  is  the  cost  of  the  7f,  and  if,  is  the  cost  of  the 
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construction  of  R  (Ng) 


construction  of  R*(j+1) 
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construction  of  R  (1^1) 


Figure  84:  Construction  of  R  circuits 


Moy: 


*••44444*20 

Figure  65:  Example  for  Fan-in  3 


If  the  circuit  is  the  top  level  circuit,  rather  than  one  down  inside  ■*  recursion, 
one  saves  Nm- 1  gates  due  to  the  unused  ”p”  outputs.  Also,  .  m)  can  be 
reduced  due  to  the  fact  that  the  most  significant  bits  in  our  present  circuit  have 
delay  equal  to  m,  which  is  faster  than  the  other  bits.  Thus,  it  is  advisable  to  use 
special  circuitry  here  which  is  slower  and  cheaper.  Table  6  shows  a  comparison 
between  these  adders  for  a  given  circuit  size  and  published  results  by  Ladner 
and  Fischer  [20]  and  Kuck  [23],  with  a  ripple  carry  circuit's  cost  included  for 
comparison.  Observe  that  these  new  circuits  are  not  only  faster,  but  in  some 
cases  cheaper  than  the  others. 


Table  6:  Cost  Comparison  of  Adder  Circuits 
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Figure  66  shows  a  fast  adder  circuit  which  would  utilize  the  carry  lookahead 
circuits  just  developed. 
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Figure  66:  Fast  Adder  Circuit 


It  is  easy  to  see  that  the  wire  lengths  as  well  as  the  total  add  delay  are  constants 
and  do  not  depend  on  the  word  size.  The  circuit  is  of  course  more  expensive 
than  a  carry  lookahead  circuit  due  to  the  fact  that  2  adders  per  bit  are  required 
and  much  area  will  be  taken  up  by  wiring  due  to  the  redundant  representation. 


However,  the  circuit  clearly  has  the  lowest  latency  of  any  we  have  seen  so  far 
and  lends  itself  quite  well  to  pipelined  processors  where  many  adds  on  the  same 
data  are  performed  as  in  a  CORDIC  rotator.  It  is  also  shown  that  the  two  adders 
which  are  required  could  be  collapsed  into  a  single  7x2  PLA 

8.1.6.  Parallel  Versus  Cascade  Structures 

The  latency  of  an  FFT  processor  can  be  considered  to  have  two  parts.  First, 
there  will  be  a  certain  latency  due  to  the  delays  of  the  gates  and  the  adder  cir¬ 
cuits  which  we  have  already  addressed.  Secondly,  there  will  be  an  additional 
latency  which  is  a  function  of  the  degree  of  parallelism  in  the  processor.  In  the 
Despain  Cascade,  for  example,  there  is  a  fundamental  lower  bound  on  the 
latency  proportional  to  the  size  of  the  transform  which  is  due  to  the  fact  that 
one  must  wait  for  all  the  outputs  to  stream  out  of  the  processor  serially. 
Clearly,  this  difficultly  can  be  side-stepped  through  the  use  of  hybrid  parallel- 
pipeline  structures  such  as  those  which  were  presented  previously.  If  cost  were 
no  object,  one  could  achieve  latencies  which  were  dominated  by  the  delay 
through  the  logr  stages  of  radix-r  DFT  modules  and  CORDIC  rotators.  However, 
in  most  cases,  the  cost  is  multiplied  by  the  same  factor  as  the  speedup,  and.  in 
the  VLSI  case,  the  cost  can  go  up  even  faster  than  this  as  we  have  seen  before. 


62.  A  Shad  Surrey  of  Fburier  Transform.  Circuits 

In  an  attempt  to  broaden  our  study  of  Fourier  Transform  circuitry,  we 
briefly  characterized  as  many  different  designs  as  possible.  Our  results  are 
summarized  in  Table  7.  Each  line  in  the  table  corresponds  to  one  of  the  circuits 
described  later  in  this  section.  All  the  performance  figures  are  reported  in 
asymptotic  form,  in  the  sense  described  below. 

A  design  is  said  to  occupy  area  N  if  there  exists  some  constant  c  for  which 
the  circuit  (built  according  to  that  design)  solving  an  fV -element  Fourier 
transform  occupies  no  more  than  cN  square  wire-widths  of  VlisI  area.  If  the 
constant  c  happened  to  be  10000,  this  means  that  a  100-element  Fourier 
transform  could  be  solved  on  a  chip  that  was  VIOOOo*  100  =  1000  wire-widths  on 
a  side. 

Similarly,  a  design  with  area  performance  N2  defines  a  family  of  circuits 
with  the  property  that  a  circuit  solving  an  jV -element  transform  occupies  no 
more  than  cN2  square  wire-widths.  In  this  case,  doubling  the  size  of  the 
transform  results  in  a  circuit  with  approximately  four  times  the  area.  The  word 
“approximately''  is  crucial  here,  since  equality  may  only  be  observed  in  the  limit 
of  infinitely  large  N.  For  small  values  of  .V,  asymptotic  performance  figures  can 
be  misleading:  an  area  performance  of  N2  is  assigned  to  a  circuit  occupying 
C\N2  +■  Ca N  area,  even  if  Cj  is  much  smaller  than  c2.  Nonetheless,  the  asymp¬ 
totic  area  performance  of  a  design  is  usually  a  good  indication  of  relative  circuit 
size. 

The  'Time"  figure  for  a  design  is  also  defined  in  asymptotic  terms.  A  time 
performance  of  N  is  assigned  to  a  design  if  its  JV-element  circuit  can  solve  one 
Fourier  transform  every  c N  clock  cycles.  The  length  of  a  clock  cycle  is  indepen¬ 
dent  of  the  value  of  N.  This  rules  out  the  use  of  gates  whose  fanout  grows  with 
transform  length,  unless  the  output  of  the  gate  is  fed  into  an  amplifier  whose 
design  (and  delay)  varies  with  N.  Similar  considerations  lead  to  a  rather  res¬ 
trictive  set  of  rules  for  designing  designs.  A  complete  list  and  explanation  of 
these  rules  is  contained  in  [24]. 

A  "Delay”  column  is  included  for  each  design,  for  it  may  define  pipelined  cir¬ 
cuits.  If  so,  the  “Delay"  figure  is  larger  than  the  “Time'*  figure.  The  former 
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refers  to  the  latency  of  the  circuit,  the.  latter  refers  to  the  number  of  clock 
cycles  that  separate  successive  transformations. 

The  column  marked  Area*Times  indicates  whether  or  not  the  design  is  an 
optimal  one.  The  last  four  have  the  best  possible  Area’TLme2  figures  Any 
modification  leading  to  a  smaller  area  figure  must  increase  the  time  figure,  for 
no  circuit  can  have  an  Area’Time2  performance  better  than  #2log2#  [24]. 


Design 

Area 

Time 

Area  •Time2  Delay 

1-cell  DFT 
#-cell  DFT 
#2-cell  DFT 
1-proc  FFT 
Cascade 

FFT  Network 
Perfect  Shuffle 
CCC 

Mesh 

#  log  # 

#  log  # 
#*log  N 

#  log  # 

#  log  # 

#a 

#2/log2# 
#2/log2# 
N  log2# 

#*log# 

#  log  # 
log  # 

#  log2# 

#  log  # 
log  # 
log2# 
log2# 

V# 

#aloga# 

#alog3# 

#2log3# 

#3log8# 

#3log3# 

#*  log2# 
#*iog2# 
#2log2# 
#2Iog*# 

#2log  # 
#alog  # 
#2log  # 

#  log2# 

#  log2# 
log2# 
log2# 
log2# 

V# 

Table  7:  Area-time  performance  of  the  Fourier  transform-solving  circuits. 

When  delay  figures  are  taken  into  consideration,  only  the  last  three  designs 
are  seen  to  be  optimal.  The  Perfect  Shuffle,  the  CCC  and  the  Mesh  are  the  only 
designs  that  achieve  the  limiting  Are  a ’Delay2  product  of  IX#8^2#).  These 
designs  keep  all  their  multiply-add  cells  and  wires  busy  solving  Fourier 
transforms  using  the  efficient  FFT  algorithm.  All  the  others,  save  one.  use  too 
few  processors  T  or  an  inefficient  algorithm.  The  FFT  network  is  an  interesting 
exception  to  this  observation.  Its  delay  inefficiency  seems  to  be  a  result  of  its 
slow  bit-serial  multipliers.  If  fast  parallel  multipliers  were  employed,  the  delay 
in  each  stage  of  the  FFT  network  might  be  as  low  as  0(loglog#).  This  would  not 
increase  its  total  area  significantly,  since  its  area  is  still  dominated  by  its 
"butterfly"  wiring.  The  improved  FFT  network  could  thus  have  a  area’time2  pro¬ 
duct  of  as  little  as  0(#®log2#loglog2#). 

As  indicated  above,  asymptotic  figures  can  hide  significant  differences 
among  supposedly  optimal  designs  due  to  "constant  factors".  The  area  and  time 
estimates  employed  in  this  study  are  not  sensitive  to  the  relative  complexity  of 
the  various  control  circuits  required  in  different  designs.  For  example,  the  #2- 
cell  DFT,  the  Cascade,  the  FFT  Network  and  the  Perfect  Shuffle  are  especially 
attractive  designs  because  they  have  no  complicated  routing  steps.  They  are 
thus  given  a  more  detailed  examination  below. 

As  indicated  in  Table  7,  the  #®-cell  DFT  is  nearly  optimal  in  its  area’time2 
performance.  However,  it  is  by  far  the  largest  design  since  it  uses  more  than  # 8 
multiply-add  cells.  (The  others  use  0(N  log  #)  or  fewer  cells.)  Using  current 
technology,  one  might  place  10  multiply-add  cells  on  a  chip  [7].  This  means 
that  one  hundred  thousand  chips  would  be  needed  for  a  thousand-element  FFT! 
Thus  the  #®-cell  DFT  design  cannot  be  considered  feasible  until  technology 
improves  to  the  point  that  100  or  1000  cells  can  be  formed  on  a  single  wafer. 
Even  then,  the  interconnections  between  chips  will  pose  some  difficulties,  for 


*  The  word  "processor"  refers  to  a  stored-program  computer.  There  may  of  course  be  many 
such  processors  In  a  single  Fourier  transform  circuit.  This  usage  of  "processor"  should  not  be  con¬ 
fused  with  the  "TFT  processor"  in  the  title  of  this  report.  An  FFT  processor  is  a  complete  Fourier 
transform  circuit.  In  an  attempt  to  avoid  further  confusion  between  ’processors"  and  "FFT  proces¬ 
sors",  this  section  always  refers  to  the  latter  as  "Fourier  transform  circuits”. 
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there  are  40  ceils  on  the  "edge"  of  a  100-cell  chip. 

The  jV-cell  DFT  is  an  attractive  design  at  present,  despite  its  non-optimal 
area*time2  performance.  It  uses  only  27V  cells  in  a  linear  array,  so  that  a 
thousand-element  Fourier  transform  can  be  implemented  with  only  10z  chips  of 
10  multiply-add  cells  each.  This  design  is  of  course  much  slower  than  the  TV^-cell 
DFT,  since  it  produces  only  one  element  of  a  transform  at  a  time  rather  than  an 
entire  transform 

The  FFT  Network  is  also  fairly  attractive  at  present,  for  its  (N/2)*( log  TV) 
cells  can  be  formed  on  about  the  same  number  of  chips  as  the  7V -cell  DFT,  yet 
its  performance  is  equal  to  the  TVa-cell  DFT.  The  drawback  of  the  FFT  Network  is 
that  the  wiring  on  and  between  the  chips  is  very  area-consuming.  It  also  has  very 
long  intercell  wires,  whereas  the  DFT  designs  use  only  nearest-neighbor  connec¬ 
tions. 

The  constant  factors  involved  in  the  Perfect  Shuffle  design  are  very  similar 
to  those  in  the  FFT  Network  discussed  above.  The  Perfect  Shuffle  uses  a  factor  of 
log  TV  fewer  cells  than  the  FFT  Network,  so  it  is  a  bit  smaller  and  slower.  How¬ 
ever,  it  suffers  from  the  same  problem  of  long  inter-chip  wires  and  poor  partitio- 
nability. 

The  Cascade  is  another  non-optimal  design,  like  the  N-cett  DFT,  that 
deserves  consideration  because  of  its  good  "constant  factors."  It  uses  only  log  TV 
multiply-add  cells  and  N  words  of  shift-register  memory.  These  are  arranged  in 
a  simple  linear  fashion.  The  Cascade  achieves  the  same  performance  as  the  TV¬ 
cell  DFT,  producing  one  element  of  a  Fourier  transform  during  each  multiply- 
add  time.  It  is  superior  to  the  TV -cell  DFT  in  that  it  uses  many  fewer  multiply- 
add  cells. 

6.2.1.  Building  blocks 

All  of  the  Fourier  transform  circuits  described  in  the  next  nine  subsections 
are  built  from  a  few  basic  building  blocks:  shift  registers,  multiply-add  cells, 
random-access  memories,  and  processors.  These  are  described  below. 

A  A: -bit  shift  register  can  be  built  from  a  string  of  k  logic  nodes  in  0(k) 
area  Each  of  the  logic  nodes  stores  one  bit.  Shift  registers  are  used  to  store  the 
values  of  variables  and  constants;  these  values  may  be  accessed  in  bit-serial 
fashion,  one  bit  per  time  unit 

Multiply-add  cells  are  used  to  perform  the  arithmetic  operations  in  a 
Fourier  transform.  Each  cell  has  three  bit-serial  inputs  a *,  x0  and  xt.  It  pro¬ 
duces  two  bit-serial  outputs 

Vo a  x0  +  c/x,  and  yjsxo-e/x,.  (1) 

The  inputs  and  the  outputs  are  all  flog  fi\  -  9(log  TV)  bit  integers. 

It  is  fairly  easy  to  see  that  a  simple  (if  slow)  multiply-add  cell  can  be  built 
from  0(log  TV)  logic  gates  [7].  The  multiplication  is  performed  by  0( log  TV) 
steps  of  addition  in  a  carry-save  adder.  The  subsequent  addition  and  subtrac¬ 
tion  can  also  be  done  in  0(log  TV)  time.  Thus  a  complete  multiply-add  computa¬ 
tion  can  be  done  in  0(log  TV)  time  and  0(log  TV)  area. 

The  aspect  ratios  of  the  multiply-add  cell  and  shift  register  may  be  adjusted 
at  will.  They  should  be  designed  as  a  rectangle  of  0(1)  width  that  can  be  folded 
into  any  rectangular  shape. 

An  5-bit  random-access  memory  with  a  cycle  time  of  0(log  5)  can  be  built 
in  0(5)  area,  using  the  techniques  of  Mead  and  Rem  [25].  (Their  area  and  time 
analyses  are  essentially  consistent  with  the  model  used  here;  see  [7]  for  a  com¬ 
parative  study  of  the  two  models.)  The  cycle  time  claimed  above  is  the  best 
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possible,  given  the  logarithmic  delay  Assumption  3c,  since  most  of  the  storage 
locations  are  at  least  VS  wire-widths  from  the  output  port  of  the  memory.  To 
achieve  this  optimal  cycle  time,  the  number  of  levels  in  Mead  and  Rem's 
hierarchical  memory  must  grow  proportionally  with  log  S. 

Processors  are  used  to  generate  control  signals,  whenever  these  become 
complex.  Each  processor  is  a  simple  von  Neumann  computer  equipped  with  an 
0(log  AO-bit  wide  ALU,  0(log  N)  registers,  and  a  control  store  with  0(log  N) 
instructions.  The  cycle  time  of  a  PE  is  0(log  N)  time  units.  This  is  enough  time 
to  fetch  and  execute  a  register-to-register  move,  a  conditional  branch,  an  "add", 
or  even  a  "multiply'1  instruction.  It  is  also  enough  time  to  allow  the  processor’s 
operands  to  come  from  an  .N-bit  random-access  memory. 

At  least  0(log*AO  units  of  area  are  required  to  implement  a  processor, 
since  it  has  0(log  N)  words  =  0(log2N)  bits  of  storage.  A  straightforward,  if 
tedious,  argument  can  be  made  to  show  that  0(log8AO  area  is  actually  sufficient 
to  build  a  processor  [7].  Neither  the  ALU,  the  data  paths,  nor  the  instruction 
decoding  circuitry  will  occupy  more  room  (asymptotically)  than  the  control 
store. 

6.2.2.  The  Direct  Fburier  Transform  on  One  Hultiply-Add  Cell 

The  naive  or  "direct”  algorithm  for  computing  the  Fourier  transform  is  to 
compute  all  terms  in  the  matrix-vector  product  of  Assumption  5d.  Following 
this  scheme,  a  total  of  N2  multiplications  are  required  when  an  jV -element  input 
vector  t  is  multiplied  by  an  N-by-N  matrix  of  constants  A,  to  yield  an  N- 
element  output  vector  Three  degrees  of  parallelism  immediately  suggest 
themselves:  the  product  may  be  calculated  on  one  multiply-add  cell,  on  N 
multiply-add  cells,  or  on  N2  multiply-add  cells.  Each  possibility  is  discussed 
separately  in  the  discussion  that  follows. 

A  single  multiply-add  cell  will  take  0(A^log  N)  time  to  perform  all  the  cal¬ 
culations  required  in  the  direct  Fourier  transform  algorithm.  (Recall  that  a 
multiply-add  calculation  takes  0(log  N)  time.)  To  this  must  be  added  the  over¬ 
head  of  calculating  the  constants  in  the  matrix  A.  since  a  prohibitively  large 
amount  of  area  would  be  required  to  store  these  explicitly.  Fortunately,  this  cal¬ 
culation  is  quite  simple.  The  constant  required  during  the  i;  -th  multiply-add 
step  (see  statement  4  of  Figure  69)  can  generally  be  obtained  by  multiplying  u * 
by  the  constant  used  in  the  previous  multiply-add  step,  A  single  proces¬ 

sor  is  capable  of  performing  this  calculation,  supplying  the  necessary  constants 
to  the  multiply-add  cell  as  rapidly  as  they  are  needed.  The  time  performance  of 
the  uniprocessor  DFT  design  is  thus  0(N2log  N). 

1.  FORi  «-  0T0JV-1  DO 

2.  Vi  «-  0: 

3.  FOR;  «-  OTO  JV-1  DO 

4.  v*  «-  Vt  + 

5.  OD; 

8.  OD. 

figure  66:  The  naive  or  "direct”  Fourier  transform  algorithm. 

The  area  required  by  the  single  multiply-add  cell  design  is  O(log  N)  for  the 
multiply-add  cell,  0(log2N)  for  the  processor  supplying  the  constants,  and 
0(N  log  N)  for  the  random-access  memory  containing  the  input  and  output 
registers.  This  last  contribution  clearly  dominates  the  others,  giving  the  unipro¬ 
cessor  DFT  design  a  total  area  of  0(N  log  N).  Its  combined  area  “time8  perfor¬ 
mance  is  thus  a  dismal  0(NalogaN).  It  has  far  too  little  parallelism  for  its  area. 
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The  designs  in  the  next  two  subsections  employ  progressively  more  parallelism 
to  achieve  better  performance  figures. 

6.2.3.  The  Direct  Fourier  Transform  on  N  Cells 

Kung  and  Leiserson  [25]  were  apparently  the  first  to  suggest  that  the 
Fourier  transform  could  be  computed  by  the  "direct”  algorithm  on  2N-1 
multiply-add  cells  connected  in  a  linear  array.  These  cells  operate  with  a  50% 
duty  cycle:  the  even-numbered  cells  and  the  odd-numbered  cells  alternately 
perform  the  computational  step  described  below.  An  obvious  optimization  [25] 
results  in  a  circuit  using  only  N  multiply-add  cells  to  accumulate  the  terms  in 
the  DFT. 

The  entire  DFT  calculation  is  complete  in  AN -3  computational  steps.  During 
each  step  in  which  it  is  active,  each  even-  (or  odd-)  numbered  cell  computes 
j/  «-  y  +  ax  using  the  value  y  provided  by  its  right-hand  neighbor  (the  leftmost 
cell  always  uses  y~0).  The  y1  values  eventually  emerging  from  the  leftmost  cell 
are  the  outputs  -g  in  natural  order.  The  inputs  £  to  the  circuit  enter  through  the 
leftmost  cell  and  are  passed,  unchanged,  down  the  line  of  cells.  Due  to  the  50% 
duty  cycle  of  the  cells,  one  y'  value  is  produced  (and  one  *  value  is  consumed) 
every  other  computational  step. 

The  only  complicated  part  of  the  circuit  has  to  do  with  computing  the  con¬ 
stant  values  a.  A  complete  description  of  this  computation  is  rather  lengthy 
[25];  only  a  sketch  is  attempted  here.  Suffice  it  to  say  that  each  a  value  is 
obtained  by  a  single  multiplication  from  the  a  value  previously  used  by  the  cell 
next  closest  to  the  center  of  the  line.  The  only  exception  to  this  rule  is  that  the 
constant-generating  circuitry  for  the  centermost  cell  must  perform  four  multi¬ 
plications  to  obtain  the  next  a  value.  (Perhaps  a  fast  multiplier  might  be  pro¬ 
vided  for  the  centermost  cell,  to  keep  it  from  slowing  down  the  whole  array.)  In 
any  event,  the  constant-generating  circuitry  for  each  cell  performs  a  fixed 
sequence  of  register-register  operations,  all  off  which  can  be  completed  in 
0(log  N)  time  and  0(log  N)  area. 

The  time  performance  of  the  //-cell  DFT  design  is  0(N  log  N),  since  each  of 
the  AN-3  computational  steps  cam  be  completed  in  0(log  N)  time.  The  total 
area  of  the  N  cells  and  their  constant-generating  circuitry  is  0(N  log  N). 

Note  that  the  total  area  of  the  N-cell  DFT  design  is  asymptotically  identical 
to  that  of  the  1-cell  design.  This  is  a  reflection  of  the  fact  that  a  register  takes 
the  same  amount  of  room  (to  within  a  constant  factor)  as  a  multiply-add  cell. 
However,  one  can  confidently  expect  that  an  actual  implementation  of  the  1-cell 
design  will  be  significantly  smaller  than  an  //-cell  design  due  to  this  "constant 
factor  difference." 

The  area*time2  performance  of  the  N- cell  DFT  design  is  0(N3\og3N).  This  is 
far  from  optimal,  but  it  is  a  great  improvement  on  the  1-cell  design.  The  next 
subsection  describes  an  /V*-cell  design  that  has  a  nearly  optimal  area*time2  per¬ 
formance  figure. 

8.2.4.  The  Direct  Fourier  Transform  on  N2  Cells 

One  way  of  boosting  the  efficiency  of  the  //-cell  DFT  design  is  to  pipeline  its 
computation.  Instead  of  circulating  intermediate  values  among  one  row  of  2N-1 
cells  for  4// -3  steps,  one  can  "unroll"  the  computation  onto  AN -3  rows  of  2N-1 
cells.  Now  each  problem  instance  spends  just  one  computational  step  on  each 
row  of  cells  before  moving  on  to  the  next  row.  (Note  that  there  are  actually 
about  BN2  cells  in  the  "//*-cell"  design) 
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All  I/O  occurs  through  the  leftmost  cell  in  the  odd-numbered  rows,  in  the 
staggered  order  shown  in  Figure  69.  This  figure  shows  only  the  I/O  for  a  single 
problem  instance;  inputs  for  successive  problem  instances  may  follow  immedi¬ 
ately  behind  the  analogous  inputs  for  the  previous  problem,  after  a  delay  of  one 
computational  step. 


More  precisely,  the  first  input  for  each  problem  instance  enters  the  left¬ 
most  cell  of  the  first  row.  The  second  input  enters  the  leftmost  cell  of  the  third 
row,  two  computational  steps  later  (remember  that  each  computational  step,  as 
defined  in  the  previous  subsection,  involves  only  "even”  or  "odd"  cells).  The  N-th 
input  enters  the  leftmost  cell  of  the  (2/',-l)-th  row,  2N  -2  computational  steps 
after  the  first  input  entered  the  circuit.  At  the  end  of  this  step,  the  first  output 
is  available  from  this  same  cell.  The  second  output  comes  from  the  leftmost  cell 
of  the  (2yv+l)-th  row,  after  two  more  steps. ..and  finally  the  N- th  output  emerges 
from  the  leftmost  cell  of  the  (4.V-3)-th  row,  (AN -3)  computational  steps  after 
the  first  input  was  injected  into  the  circuit. 

As  noted  above,  the  fc  -th  input  for  another  problem  instance  can  follow 
immediately  behind  the  fc-th  input  for  the  previous  problem,  delayed  by  only 
one  computational  step.  The  circuit  thus  operates  in  pipelined  time 
T  -  0(log  N).  The  total  area  of  the  jVz-cell  design  is  A  =  0(jVzlog  N),  since  each 
cell  occupies  0(lc-g  N )  area.  The  combined  area*timez  performance  of  the 
design  is  only  a  factor  of  0(log  N)  from  the  optimal  figure  of  Q(jVzlogzyV).  Thus  it 
is  pointless  to  look  for  a  smaller  circuit  with  a  similcr  pipelined  time  perfor¬ 
mance.  However,  it  is  possible  to  make  great  improvements  on  this  circuit’s 
solution  delay,  as  shown  by  the  (N  log  yv)-cell  FFT  design  presented  in  a  later 
subsection. 

It  is  fairly  easy  to  describe  a  few  "constant  factor"  improvements  to  the 
^*-cell  DFT  design.  First  of  all,  at  least  half  of  the  cells  on  each  row  are  idle,  due 
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to  the  50%  duty  cycle  inherent  in  the  Kung-Leiserson  approach.  Secondly,  the 
computations  done  in  the  shaded  portion  of  Figure  2  are  irrelevant  (the  result¬ 
ing  y'  values  do  not  affect  the  circuit’s  outputs).  Each  of  these  considerations 
halves  the  number  of  required  multiply-add  cells,  leaving  fewer  than  2 Nz  cells  in 
an  optimized  design.  Finally,  the  constant-generating  circuitry  described  for 
the  N-cell  design  need  not  be  carried  over  to  the  ,/V2-cell  design,  for  each  cell 
uses  the  same  a  value  every  time  it  does  a  computational  step.  In  other  words, 
the  constant  matrix  A  can  be  "hard-wired”  into  the  registers  of  the  multiply-add 
cells,  to  circulate  the  constant  matrix  A  among  the  multiply-add  cells. 

6.2.5.  Hie  Fast  Fourier  Transform  on  One  Processor 

Up  to  now,  all  the  circuits  have  computed  the  Fourier  transform  by  the 
naive  or  direct  algorithm.  Great  increases  in  efficiency  are  observed  in  conven¬ 
tional  uniprocessors  using  the  fast  Fourier  transform  algorithm;  it  would  be 
remarkable  indeed  if  we  could  not  take  advantage  of  our  knowledge  of  the  FFT  in 
the  design  of  Fourier  transform  circuits. 

There  are  a  number  of  versions  of  the  FFT  in  the  literature,  differing  chiefly 
in  the  order  in  which  they  use  inputs,  outputs,  and  constants.  Figure  70  shows  a 
"decimation  in  time"  algorithm,  taken  from  Figure  5  of  [26].  Figure  71  shows  a 
"decimation  in  frequency"  algorithm,  adapted  from  Figure  10  of  [26].  In  both 
cases,  the  N  problem  inputs  are  stored  in  xf,  the  N  problem  outputs  are  yt,  and 
u  is  a  principal  IV-th  root  of  unity. 

1.  FOR  b  «-  (log  jV)  -  1  TO  0  BY  -1  DO 

2.  p  2b;  q  *-  N/p;  /*  note  that  N  =pq  */: 

3.  *  «-«**;/•  z  is  a  principal  g-th  root  of  unity  •/; 

4.  FORi  ♦  OTO  N-l  DO 

5.  j*-imodq\  k  «-  reverse (i); 

6.  IF  (k  mad  p)  =  (k  mod  Zp)  THEN 

7.  <xk,  xktp>  -  <xk  +  xk  -  zjxk+p>\ 

8.  FI; 

9.  OD; 

10.  OD; 

11.  FORi  «-  0  TO  N-l  DO  /•  unscramble  outputs  */: 

12.  VrttwmCO  *"  *i' 

13.  OD. 

Figure  70:  The  FFT  by  "decimation  in  time."  Note:  reverse  (i)  interprets  i  as 
an  unsigned  (log  N)-bit  binary  integer  then  outputs  that  integer  with  its 
bits  reversed,  i.e.,  with  its  most-significant  bit  in  the  least-significant  posi¬ 
tion. 

Either  Figure  70  or  71  may  be  used  as  an  algorithm  for  a  uniprocessor  that 
runs  in  0(N  log  N)  computational  steps.  The  total  area  of  such  a  design  is 
0(N  log  N),  due  mostly  to  input  and  output  storage.  (Recall  that  a  single  pro¬ 
cessor  fits  in  0(log2jV)  area)  Total  time  for  an  W -element  FFT  is  0(N  log *JV). 
since  each  computational  step  takes  0(log  N)  time  units.  This  is,  as  expected,  a 
vast  improvement  over  the  uniprocessor  DFT  circuit.  However,  it  is  far  from 
being  area*timez  optimal,  for  its  processor/memory  ratio  is  too  high.  Adding 
more  processors,  as  in  the  following  design,  increases  the  performance  of  an  FFT 
circuit. 
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1.  FOR  6  «-  (log  N)  -  1TOO  BY  -1  DO 

2.  p  «-  26  ;  q  *-  N/p\ 

3.  z  *-  u9/z;  /•  *  is  a  principal  2p-th  root  of  unity  */; 

4.  FORi  «-  0T0  N-l  DO 

5.  j  «-  i  mod  p : 

6.  IF  (i  mod  2 p)  =  j  THEN 

7.  <Xt  12t4})>«-  <x*  +  Xi+p.  2* Xi  ~  z* Xi+P>; 

8.  FI; 

9.  OD; 

10.  OD; 

11.  FOR  i  «-  0  TO  jV-1  DO  /•  unscramble  outputs  •/: 

12-  Wr»ww*(t)  *-  *t: 

13.  OD. 

Figure  71:  The  FFT  by  "decimation  in  frequency.” 


6.2.6.  Hie  Cascade  Implementation  of  the  Fast  Fourier  Transform 

The  Cascade  arrangement  of  log  N  multiply-add  cells  [2]  was  discussed  at 
great  length  earlier  in  this  report  in  Section  2.2.  At  the  risk  of  seeming  repeti¬ 
tious.  we  will  describe  it  again  using  the  theoretical  notation  of  this  section. 

In  a  Cascade,  one  of  the  outputs  of  each  multiply-add  cell  is  connected  to 
the  input  of  a  shift  register  of  an  appropriate  length.  See  Figure  72  The  shift 
register’s  output  is  connected  to  one  of  the  multiply-add  cell’s  inputs,  forming  a 
feedback  loop.  The  remaining  inputs  and  outputs  of  the  multiply-add  cells  are 
used  to  connect  them  into  a  linear  array.  Problem  inputs  (values  of  £)  are  fed 
into  the  leftmost  cell;  problem  outputs  (values  of  \J)  emerge  from  the  rightmost 
cell.  The  decimation  in  frequency  algorithm  of  Figure  71  is  employed,  to  keep 
the  cells'  computations  as  simple  as  possible. 


Figure  72:  The  Cascade  arrangement  of  3  multiply-add  cells,  for  computing 
8-element  FFTs.  The  multiply-add  cells  are  square;  the  rectangular  boxes 
each  represent  one  word  of  shift  register  storage. 


Each  cell  handles  the  computations  associated  with  a  single  value  of  the 
loop  index  6  in  Figure  71.  The  leftmost  cell  performs  the  loop  for  6  =  log  N  —  1; 
the  rightmost  cell  performs  the  loop  computations  for  6=0.  The  pairing  of  x 
values  indicated  in  statement  7  of  Figure  71  is  accomplished  by  the  2*  -word  shift 
register  associated  with  cell  6 . 

The  attentive  reader  will  note  that  statement  7  is  not  exactly  the  same  as 
the  multiply-add  step  defined  in  Equation  (1).  Statement  7  involves  one  con¬ 
stant  value  t* ,  two  variable  values  x^  and  xt+n,  two  additions,  but  two  (instead  of 
one)  multiplications.  Thus  its  computation  will  take  about  twice  as  much  time 
or  area  as  a  ’’standard”  multiply-add  step. 
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The  conditioned  test  of  statement  6  is  implemented  by  having  each  cell 
monitor  the  6-th  bit  of  the  count  i  of  input  elements  that  it  has  already  pro¬ 
cessed.  The  condition  of  statement  6  is  satisfied  whenever  that  bit  is  0.  In  this 
case,  a  cell  performs  the  computation  indicated  in  statement  7.  It  sends  the 
new  value  for  to  the  right,  and  retains  the  new  value  for  Zt+p  in  its  shift  regis¬ 
ter.  Whenever  the  6-th  bit  of  i  is  1,  no  multiply-add  computations  are  per¬ 
formed.  However,  some  data  movement  is  necessary:  the  data  appearing  on  the 
cell's  lower  input  line  should  be  copied  into  its  shift  register.  Also,  the  values 
emerging  from  its  shift  register  should  be  sent  on  to  the  next  cell  on  its  right. 

One  of  the  advantages  of  using  the  decimation  in  frequency  algorithm  on 
the  Cascade  is  the  ease  of  computing  the  constants  for  its  multiply-add  steps. 
Only  a  few  registers  and  a  single  multiplier  are  required  to  generate  the  con¬ 
stants  required  by  each  cell.  Referring  again  to  the  program  of  Figure  71,  the 
constant  z *  required  in  statement  7  may  be  obtained  by  multiplying  the  previ¬ 
ously  generated  constant  z*~l  by  2.  If  this  multiplication  is  performed  whether 
or  not  statement  7  is  executed,  no  conditional  transfers  are  necessary  in  the 
constant-generating  circuitry.! 

As  noted  above,  the  constant-generating  circuitry  for  each  cell  consists  of  a 
multiplier  and  a  few  registers.  It  is  thus  comparable  in  area  and  time  complex¬ 
ity  to  the  multiply-add  cell  itself.  Thus  the  total  area  of  the  Cascade  design  is 
obtained  by  multiplying  the  number  of  cells,  log  TV.  by  the  area  per  cell 
0(log  TV).  To  this  must  be  added  the  area  of  the  shift  registers.  Unfortunately, 
there  is  a  total  of  TV-1  words  of  storage  in  these  registers,  so  the  entire  design 
occupies  0(N  log  TV)  area.  Thus  the  Cascade,  like  the  one-processor  design,  is 
almost  all  memory.  An  entire  problem  instance  must  be  stored  in  the  circuit 
while  the  Fourier  transform  is  in  progress. 

The  time  performance  of  the  Cascade  is  somewhat  improved  over  the  one- 
processor  design.  Input  values  enter  the  leftmost  processor  at  the  rate  of  one 
per  multiply-add  step.  An  entire  problem  instance  is  thus  loaded  in  0(N  log  TV) 
time  units.  It  is  easy  to  see  that  the  Cascade  can  start  processing  a  new  prob¬ 
lem  instance  as  soon  as  the  previous  one  has  been  completely  loaded,  so  its 
"pipelined  time"  performance  is  T  -  0(N  log  TV). 

One  awkward  feature  of  tht  Cascade  is  that  it  produces  its  output  values  in 
bit-reversed  order.  Formally,  their  order  is  derived  from  the  natural  left-to- 
right  indexing  (0  to  TV-1)  by  reversing  the  bits  in  each  index  value,  so  that  the 
least  significant  bit  is  interpreted  as  the  most  significant  bit.  The  last  few  lines 
of  Figure  71  perform  this  bit-reversal,  but  they  cannot  be  performed  on  the  cir¬ 
cuit  described  thus  far.  If  natural  ordering  is  desired,  a  processor  should  be 
attached  to  the  output  end  of  the  Cascade.  If  this  processor  has  TV  words  of  RAM 
storage,  a  simple  algorithm  will  allow  it  to  reorder  the  outputs  of  the  Cascade  as 
rapidly  as  they  are  produced. 

8.2.7.  The  F¥T  Network 

One  of  the  most  obvious  ways  of  implementing  the  FFT  in  hardware  is  to 
provide  one  multiply-add  cell  for  each  execution  of  statement  7  in  the  algorithm 
of  Figure  70.  (The  algorithm  of  Figure  71  might  also  be  used,  but,  as  noted  in  the 
previous  subsection,  its  multiply-add  computation  is  a  little  more  complex.) 
Each  cell  is  provided  with  a  register  holding  its  particular  value  of  zK  Since 

t  Note  that  zi  =  whenever  the  6-th  bit  of  i  is  0,  ilnce  z  is  a  2p-th  root  of  unity.  Of 
course,  exact  equality  obtains  only  when  exact  arithmetic  is  employed.  This  is  easy  to  arrange  in  a 
number-theoretic  transform.  When  round-olT  errors  cannot  be  avoided,  for  example  in  a  complex- 
valued  transform,  it  is  probably  best  to  use  a  conditional  transfer  to  reset  2^  to  1  whenever  j  —  Q. 
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statement  7  is  executed  N/ZlogN  times,  a  total  of  N/2\.ogN  multiply-add 
cells  are  required  for  this  "full  parallelization"  of  the  FFT. 

One  possible  layout  for  the  cells  in  an  FFT  network  is  to  have  log  N  rows  of 
N/2  cells  each,  as  shown  in  Figure  73.  Each  row  of  cells  in  the  FFT  network 
corresponds  to  an  entire  iteration  of  the  "FOR  b  "  loop  of  the  algorithm  of  Figure 
70.  The  interconnections  between  the  rows  are  defined  by  the  way  that  the 
array  £  is  accessed.  The  reader  is  invited  to  check  that  each  multiply-hdd  ceil  in 
Figure  73  corresponds  to  an  execution  of  statement  7  in  Figure  70  for  the  case 
iV=8. 

Note  that  the  inputs  to  the  FFT  network  are  in  “bit-shuffled"  order  and  its 
outputs  are  in  "bit-reversed"  order.  This  seems  to  minimize  the  amount  of  area 
required  for  interconnecting  the  rows.  Additional  wiring  may  of  course  be  added 
to  place  inputs  and  outputs  in  their  natural,  left-to-right  order. 

The  interconnections  of  Figure  73  may  be  obtained  from  the  following  gen¬ 
eral  scheme.  Number  the  cells  naturally:  from  0  to  N/  2-1,  from  left  to  right. 
Then  cell  i  in  the  first  row  is  connected  to  two  cells  in  the  second  row:  cell  i  and 
cell  (i  +  N / 4)  mod  N/2.  Cell  i  in  the  second  row  is  connected  to  cells  i  and 
| i/  (N/  4)J  +  ((i  +  N/  8)  mod  N /  4)  in  the  third  row.  Cell  i  in  the  Jfcth  row  (where 
k  =  1,2,... lag  N  -  1)  is  connected  to  two  cells  in  the  (ifc  +  l)-th  row:  cell  i  and  cell 
^/(JV/2*)J  +  ((i  +•  N/  Z?*1)  mod  N/&).  Another  way  of  describing  this 
"butterfly"  interconnection  pattern  is  to  say  that  a  cell  on  the  ifcth  row  connects 
to  the  two  cells  on  the  next  row  whose  indices  differ  at  most  in  their  Jfc  th  most 
significant  bit.  (The  interconnections  between  rows  in  an  FTT  network  can  also 
be  laid  out  in  the  "perfect  shuffle"  pattern  described  in  the  next  subsection. 
However,  this  seems  to  lead  to  a  larger  layout,  if  only  by  a  constant  factor.) 

A  careful  study  of  Figure  73  and  the  preceding  paragraph  should  convince 
the  reader  that  N/2  horizontal  tracks  are  necessary  and  sufficient  for  laying 
out  the  interconnections  between  the  first  two  rows.  Essentially,  each  cell  in  the 
first  row  has  one  “long"  output  wire  that  must  cross  the  vertical  midline  of  the 
diagram.  This  connection  must  be  assigned  a  unique  horizontal  track  to  cross 
the  midbne.  Once  this  is  done,  the  rest  of  the  wiring  for  that  row  is  trivial,  espe¬ 
cially  if  the  cells  a*e  "staggered"  slightly  as  in  Figure  73. 

The  connections  between  the  second  and  third  rows  occupy  only  N/  4  hor¬ 
izontal  tracks.  No  wires  cross  the  vertical  midline  of  the  diagram,  but  each  of 
the  N/  4  cells  on  either  side  of  the  midline  have  a  fairly  long  connection  that 
takes  up  to  half  of  a  horizontal  track. 

In  general,  the  connections  emerging  from  the  ifcth  row  (Jfc=0.1,...Zop  N  -  1) 
occupy  N/&*1  tracks.  Straight  vertical  wires  are  used  to  connect  cell  i  in  the 
Jfcth  row  with  cell  i  in  the  (/fc  +  l)th  row.  The  horizontal  tracks  are  divided  into  2* 
equally-sized  pieces,  then  individually  assigned  to  the  "long”  connection  from 
each  cell. 

Following  the  scheme  outlined  above,  a  total  of  Af-1  horizontal  tracks  are 
required  to  lay  out  the  inter-row  connections.  An  additional  N  horizontal  tracks 
could  be  added  above  and  below  the  FFT  network  to  bring  its  inputs  and  outputs 
into  natural  order. 

The  number  of  vertical  tracks  in  an  FFT  network  depends  strongly  upon  the 
width  of  the  multiply-add  cells.  If  these  are  set  on  end,  so  that  each  is  0(1)  units 
tall  and  0(log  N)  units  wide,  then  the  entire  network  will  fit  into  a  rectangular 
region  that  is  O(N)  units  wide  and  O(N)  units  talL  The  height  of  the  log  N  rows 
of  multiply-add  cells  is  asymptotically  negligible. 

The  pipelined  time  performance  of  the  FFT  network  is  clearly  0(log  N) 
since  a  new  problem  instance  can  enter  the  network  as  soon  as  the  previous  one 
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Vlgure  73:  The  FFT  network  for  /V=8. 

has  left  the  first  row  of  multiply-add  cells.  The  delay  Imposed  by  each  row’s 
multiply-add  computation  and  long-wire  drivers  is  0(log  N).  and  there  are 
0(log  N)  raws,  so  the  total  delay  of  the  network  is  0(logzN). 

Note  that  this  layout  of  the  FFT  network  must  be  optimal,  for  the  circuit 
has  an  optimal  area*time8  performance  of  0(Nzlog2.N).  Any  asymptotic  improve¬ 
ment  in  the  layout  area  would  amount  to  a  disproof  of  Vuillemin’s  optimality 
result  [27], 

6.2.8.  The  Perfect-Shuffle  Implementation  of  the  FFT 

Over  a  decade  ago,  Stone  [26]  noted  that  the  "perfect  shuffle”  interconnec- 
tion  pattern  of  N/  2  multiply-add  cells  is  perfectly  suited  for  an  FFT  computa¬ 
tion  by  decimation  in  time.  Figure  74  shows  the  perfect  shuffle  network  for  the 
6-element  FFT,  and  figure  70  shows  the  appropriate  version  of  the  FFT  algorithm. 


Figure  74:  The  perfect  shuffle  interconnections  for  N=8. 

Each  multiply-add  cell  in  a  perfect  shuffle  network  is  associated  with  two 
input  values,  xk  and  x**,.  Here,  k  is  an  even  number  in  the  range  0  a*  k  <  N- 1. 
A  connection  is  provided  from  one  of  the  outputs  of  the  cell  containing  xk  to  one 
of  the  inputs  of  the  cell  containing  x;  if  and  only  if  j  =  2k  mod  N- 1.  Note  that 
this  mapping  of  output  indices  onto  input  indices  is  one-to-one.  and  that  it 
corresponds  to  an  "end-around  left  shift"  of  the  (log  7V)-bit  binary  representa¬ 
tion  of  k . 

The  computation  of  the  FFT  on  the  perfect  shuffle  network  can  now  be 
described.  First,  the  input  values  xk  are  loaded  into  their  respective  multiply- 
add  cells.  Then  a  multiply-add  step  is  performed:  each  cell  ships  its  original  xk 
values  out  over  its  output  lines,  and  computes  new  xk  values  according  to  Equa¬ 
tion  (1).  It  is  not  very  obvious,  but  nonetheless  it  is  true,  that  this  corresponds 
to  an  entire  iteration  of  the  "FOR  b  "  loop  of  Figure  70.  For  example,  the  left¬ 
most  cell  of  Figure  73  computes  new  values  for  z0  and  x4.  having  received  the 
original  value  of  the  former  from  its  own  output  line  and  the  original  value  of  the 
latter  from  the  third  cell.  This  is  the  computation  required  by  step  7  of  Figure 
70.  when  N-8,  b  =2,  p=4,  q  =2,  i  =0,  j  =0,  and  k  =0. 

The  FFT  computation  proceeds  in  this  fashion  for  log  N  parallel  multiply- 
add  steps.  In  each  step,  the  cell  containing  the  (updated)  version  of  xk  ships  this 
value  to  the  cell  formerly  containing  the  (updated)  version  of  mod  N-v  Each 
cell  then  performs  a  multiply-add  computation,  updating  the  two  data  values 
currently  in  its  possession. 

At  the  end  of  log  N  parallel  multiply-add  steps,  each  cell  contains  the  final 
versions  of  its  original  data  values.  Unfortunately,  the  FIT  computation  of  Figure 
70  is  not  complete.  The  outputs  -g  are  all  available  among  the  final  t  values,  but 
they  appear  in  "bit-reversed"  order.  Additional  circuitry  is  required  to  bring 
them  into  natural  order,  following  steps  11-13  of  Figure  70.  The  techniques  of 
[29]  could  be  employed  in  the  design  of  reordering  circuitry  that  would  operate 
in  0( loga/V)  time,  without  affecting  the  area  performance  of  the  perfect  shuffle 
network. 

The  log  N  parallel  multiply-add  steps  require  a  total  of  0(logzN)  time.  The 
data  movement  involved  in  each  multipty-add  step  does  not  require  any  addi¬ 
tional  time,  at  least  in  an  asymptotic  sense.  As  will  be  seen  below,  the  "shuffle" 
connections  between  cells  are  implemented  as  single  wires  carrying  bit-serial 
data.  Each  wire  is  less  than  0(N)  units  long,  and  each  word  has  0(log  N)  bits,  so 
that  the  data  transmission  time  per  step  is  the  same  as  the  multiplication  time. 
0(log  N)  time  units. 

The  total  area  of  the  perfect  shuffle  implementation  is  a  bit  harder  to  esti¬ 
mate.  There  are  N/  2  multiply-add  c  jlls,  each  occupying  0(log  N)  are.  However, 
the  best  embedding  known  for  the  shuffle  interconnections  takes  up 
0( A^/log2^)  area  [30].  It  is  easy  to  see  that  no  better  embedding  is  possible, 
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since  otherwise  the  perfect  shuffle  circuit  would  have  an  impossibly  good 
area'time2  performance. 

6.2.9.  The  CCC  Network 

The  cube-connected-cycles  (CCC)  interconnection  for  N  cells  is  capable  of 
performing  an  jV-eLement  FFT  in  0(log  N)  multiply-add  steps  [31].  Using  the 
multiply-add  cell  of  the  previous  constructions,  the  complete  FFT  takes 
0(log2N)  time. 

The  CCC  network  is  very  closely  related  to  the  FFT  network.  In  fact,  a  CCC 
network  is  just  an  FFT  network  with  "end-around”  connections  between  the  first 
and  last  rows.  For  this  reason,  CCC  networks  do  not  exist  for  all  N,  only  for  those 
N  of  the  form  (K/  2)*  (log  K)  for  some  integer  K.  Figure  ?5  illustrates  the  CCC 
network  for  N= 8.  It  is  derived  from  the  4-element  FFT  network  with  "split  cells”: 
each  cell  handles  one  element  of  the  input  vector  t ,  instead  of  two  as  in  the  FFT 
network  of  Figure  73.  (The  reader  is  invited  to  redraw  Figure  75,  combining  the 
cells  linked  by  horizontal  data  paths.  The  resulting  graph  should  be  isomorphic 
to  a  "butterfly”  whose  outputs  have  been  fed  back  into  its  inputs.) 


Figure  75:  The  CCC  network  for  !V=8. 

The  CCC  network  is  somewhat  smaller  than  the  FFT  network,  since  it  uses 
only  N  cells  to  solve  an  ^-element  problem  instead  of  the  FFT  network's 
(N/  2)* (log  N)  cells.  Furthermore,  the  CCCs  interconnections  can  be  embed¬ 
ded  in  only  OiN2/  \ogzN)  area  [31].  This  is  an  optimal  embedding,  for  the  com¬ 
bined  area*time8  performance  is  within  a  constant  factor  of  the  limit, 
Q(N2log*N). 

It  is  rather  difficult  to  describe  the  data  routing  pattern  during  the  compu¬ 
tation  of  a  Fourier  transform  on  a  CCC,  although  the  basic  approach  is  similar  to 
that  taken  on  the  perfect  shuffle  network.  Each  of  the  log  N  multiply-add  steps 
is  preceded  and  followed  by  a  routing  step.  These  routing  steps  take  O(log  N) 
time  each,  for  they  move  0(1)  words  over  each  intercellular  connection.  Thus 
the  total  time  spent  in  routing  data  does  not  dominate  the  time  spent  on 
multiply-add  computations. 


6.2.10.  The  Mesh  Implementation 

A  square  Mesh  of  N  processors  is  shown  in  Figure  76.  It  consists  of  approxi¬ 
mately  vW  rows  of  y/77  processors  each,  fitted  with  word-parallel  interconnec¬ 
tions.  It  is  thus  essentially  the  ILLIAC  IV  architecture,  with  the  difference  that 
each  processor  in  the  Mesh  is  capable  of  running  its  own  program.  (A  closer 
approximation  to  the  ILLIAC  IV  would  have  N  multiply-add  cells,  each  deriving 
control  signals  from  a  central  processor.) 
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Figure  76:  The  Mesh  of  N  processors,  formed  of  Z rows  and  2^°*  **/zl 

columns. 

The  total  area  of  the  Mesh  is  0(N  log ZN).  since  there  are  N  processors  each 
of  0(log*N)  area.  The  processors  should  each  be  laid  out  with  a  square  aspect 
ratio,  so  that  the  0(log  N)  wires  in  each  word-parallel  data  path  do  not  add  to 
the  asymptotic  area  of  the  layout.  Note  that  it  takes  0(loglog  N)  time  to  send  a 
word  of  data  from  one  processor  to  its  neighbor,  since  the  interprocessor  wires 
are  0( log  N)  in  length. 

Stevens  [32]  appears  to  have  been  the  first  to  point  out  that  the  Mesh  can 
perform  an  W-element  FFT  in  log  N  steps  of  computation.  Each  "step''  consists 
of  an  entire  iteration  of  the  FOR  6  loop  o?  Figure  70.  Each  processor  in  the  Mesh 
performs  the  loop  computation  for  one  value  of  the  index  variable  k.  The  total 
amount  of  data  movement  during  the  FFT  can  be  minimized  by  making  an 
appropriate  assignment  of  index  values  k  to  individual  Mesh  processors.  It  turns 
out  that  a  fairly  good  choice  is  obtained  from  the  natural  row-major  ordering  (0 
to  N- 1)  of  the  Mesh.  Processor  k  is  then  the  "home"  of  the  variable  x*. 

(Another,  more  intuitive  way  of  visualizing  the  computation  of  the  FFT  on 
the  Mesh  is  to  view  the  latter  as  a  time-multiplexed  version  of  the  FFT  network. 
During  each  step,  N/2  of  the  Mesh's  processors  take  on  the  role  of  the  N /  2 
cells  in  one  row  of  the  FFT  network.  The  wires  connecting  the  rows  of  the  FFT 
network  are  simulated  by  data  movement  among  the  processors  of  the  Mesh.) 

An  iteration  of  the  FOR  b  loop  of  Figure  70  can  now  be  described.  Each 
mesh  processor  examines  the  6  -th  bit  of  reverse  (k )  to  decide  if  it  will  perform 
the  computation  of  statement  7.  (For  example  when  b  =0.  n2=l  so  that  only  the 
even-numbered  processors  will  perform  statement  7.)  Next,  each  processor  that 
will  not  perform  statement  7  sends  its  current  value  of  xk  to  processor  k  +2b. 
(For  example,  when  b  =0,  each  odd-numbered  processor  sends  its  x  value  to  the 
processor  on  its  left.)  Statement  7  is  then  executed,  and  finally  the  updated  xk 
values  are  returned  to  their  "home"  processors. 
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When  6  =  log  N  -  l,  the  data  movement,  required  before  statement  7  can  be 
visualized  by  "sliding’'  all  the  xk  values  in  the  bottom  half  of  the  Mesh  up  to  the 
top  half  of  the  Mesh.  In  this  way,  processor  0  receives  the  current  value  of  Xf//z, 
processor  1  receives  the  value  of  x^/ a+i.  etc.  This  particular  data  movement  will 
be  called  a  "distance-N/2  route."  In  general,  a  distance-26  route  must  be  per¬ 
formed  both  before  and  after  each  execution  of  statement  7. 

The  time  required  bv  a  distance-26  route  depends,  of  course,  on  the  value  of 
b.  When  6  =  0  or  2^°*  all  data  movement  is  between  nearest  neighbors 
(horizontal  or  vertical)  in  the  Mesh.  As  mentioned  above,  this  takes  only 
0(loglog  N)  time. 

When  6  =  2 Klo*W/sl  or  log  N  -  1,  it  would  seem  that  0(V 77  loglog  N)  time  is 
required  for  a  distance-26  route.  Each  data  element  must  ripple  through  about 
>/7v  /2  processors.  However,  this  result  may  be  improved  by  using  the  "high- 
power"  inputs  on  the  long-wire  drivers  on  the  interprocessor  data  paths  (see 
Assumption  Id).  Once  the  bits  in  a  data  element  have  been  amplified  enough  to 
be  sent  to  a  neighboring  processor,  only  one  more  stage  of  amplification  is 
necessary  to  send  these  bits  on  to  the  next  processor.  Since  the  amplifier  stages 
in  a  long-wire  driver  are  individually  clocked,  all  data  elements  in  a  routing 
operation  may  "slide"  toward  their  destination  simultaneously,  moving  by  one 
processor-processor  distance  every  time  unit.  The  total  time  taken  by  a 
distance-26  routing  is  thus  easily  seen  to  be  (26  mod  Klog  N)/  iej)  +  0(loglog  N). 

The  total  time  taken  by  all  routings  in  a  complete  FFT  computation  is 
bounded  by  0(y/77 ).  Essentially,  this  is  the  sum  of  a  geometric  series  whose  larg¬ 
est  term  is  the  time  taken  by  the  longest  routing  operation.  0(V77).  The  time 
performance  of  the  Mesh  design  is  thus  0(VW).  At  least  asymptotically,  the 
0(log2AO  time  required  for  the  multiply-add  computations  is  insignificant  com¬ 
pared  to  the  time  required  for  the  routing  operations. 

Three  aspects  of  the  Mesh  implementation  deserve  further  attention.  First 
of  all,  the  individual  processors  are  expected  to  come  up  with  their  own  z* 
values,  as  they  execute  statement  7  of  Figure  70.  This  is  not  difficult  to  arrange: 
each  processor  has  0(\ogzN)  bits  of  program  storage,  so  it  can  easily  perform  a 
table  look-up  to  obtain  the  required  constants.  One  constant  is  needed  for  each 
processor,  for  each  value  of  b . 

Secondly,  the  algorithm  described  computes  the  values  in  bit-reversed 
ordfer  (relative  to  the  natural  row-major  ordering  of  the  Mesh).  If  the  outputs  are 
desired  in  natural  order,  another  0(v77)  routing  operations  are  required  [29), 
and  the  individual  processors’  programs  become  a  bit  more  complicated. 

One  final  note:  the  Mesh  implementation,  as  described,  is  area*time2 
optimal.  A  slightly  less  efficient,  but  possibly  more  practical  design  has  been 
suggested.  Instead  of  using  word-parallel  buses  between  N  processors  in  a 
mesh,  one  might  provide  bit-serial  buses  between  N  cells  in  a  mesh.  Now  the 
best  possible  time  performance  is  constrained  by  the  bit-serial  buses  to  be  no 
better  than  0(VN  log  N).  Similarly,  the  are*,  could  be  reduced  to  as  little  as 
0(N  log  N).  However,  it  will  be  a  bit  tricky  to  attain  these  performance  figures. 
There  is  not  enough  area  to  store  each  cell’s  z 1  values  locally,  so  these  values 
must  be  computed  "on  the  fly"  in  (hopefully)  only  few  extra  multiplications.  This 
seems  to  be  impossible  to  accomplish  directly.  One  solution  to  this  difficulty  is 
to  have  the  cells  exchange  z^  values  as  well  as  xk  values.  The  bit-serial  approach 
is  thus  inherently  slower  both  in  routing  time  and  in  the  number  of  necessary 
multiplications.  On  the  other  hand,  the  word-parallel  approach  has  wider  buses 
and  perhaps  larger  look-up  tables,  so  that  it  takes  up  somewhat  more  area. 
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7.  CONCLUSIONS 

We  have  investigated  the  problem  of  implementing  Discrete  Fourier 
Transform  processors  in  VLSI  on  many  different  levels.  The  original  idea  of  using 
Charge  Transfer  Devices  to  implement  the  building  blocks  of  these  processors 
was  discarded  when  it  became  clear  that  other  technologies  such  as  CMOS  and 
NMOS  had  advantages  in  terms  of  power  dissipation,  speed,  compactness, '  and 
complexity  of  handling  the  necessary  crossovers. 

On  an  organizational  level,  the  pipeline  structure  of  Despain  was  shown  to 
map  very  well  into  a  VLSI  structure  where  it  is  desired  to  build  as  few  different 
chip  types  as  possible  while  allowing  the  construction  of  processors  which  are 
capable  of  computing  transforms  of  arbitrary  size.  An  extension  of  this  struc¬ 
ture  to  allow  higher  throughput,  lower  latency  processors  to  be  built  also  has 
the  same  desirable  features  due  to  the  incredible  flexibility  which  is  inherent  in 
the  DFT  computation.  This  extension  allows  one  to  trade  off  speed  and  cost  in 
any  way  desired,  especially  for  large  transform  sizes  where  the  communication 
requirements  do  not  force  one  to  use  less  than  the  maximum  number  of  transis¬ 
tors  available  on  a  chip. 

The  higher  radix  structures  which  were  derived  are  to  be  preferred  in  gen¬ 
eral  since  they  require  less  full  CORDIC  rotators.  This  reduces  latency  and 
hardware  costs,  especially  where  the  operations  needed  to  realize  the  higher 
radix  butterfly  are  simple  and  can  utilize  partial  multiplier  circuits. 

Several  chips  which  could  be  used  to  build  processors  in  this  style  were 
designed.  The  butterfly  and  CORDIC  chips  could  be  used  to  put  together  arbi¬ 
trarily  large  radix  2  or  radix  4  (due  to  the  built-in  ^-rotator  of  the  butterfly) 

transforms.  The  bit-skewed  format  turned  out  to  have  unfortunate  properties 
for  the  CORDIC-styie  circuits,  but  worked  well  for  the  butterfly  modules.  The 
real  and  imaginary  multiplexing  was  necessary  at  the  time  due  to  pin  limita¬ 
tions,  but  also  avoided  crossovers  internal  to  the  chips.  The  -7^- rotator  was  to 

ID 

be  used  along  with  a  5- rotator  in  the  construction  of  a  16-point  DFT  processor, 
□ 

but  improvements  in  circuit  density  allowed  us  to  build  the  more  general 
CORDIC  circuit  which  could  replace  them  both.  Also,  various  experimental  chips 
such  as  the  root-3  partial  multiplier  and  the  4  bit-slice  butterfly  which  were 
fabricated  early  on  verified  the  ease  of  mapping  the  algorithms  into  silicon.  The 
barrel  shifter  module  was  successful  and  could  be  used  in  a  less  expensive,  lower 
performance,  iterative  CORDIC  design. 

Regular,  pipeline  organizations  which  drew  on  the  work  of  Winograd  to 
reduce  multiplies  in  DFT  computations  were  developed.  These  structures  are 
quite  compact,  but  would  require  the  development  of  many  different  chip  types 
if  very  large  VLSI  processors  were  to  be  built. 

The  problem  of  latency,  already  addressed  from  an  organizational 
viewpoint,  has  also  been  attacked  by  the  development  of  new  carry  lookahead 
circuits  which  are  faster  and  cheaper  than  any  others  which  have  been  seen  in 
the  published  literature.  The  new  circuits  also  allow  one  to  realize  the  inherent 
speed  advantages  that  exist  in  high  fan-in  over  low  fan-in  gat  s. 

Finally,  new  work  in  complexity  theory  using  a  mouel  which  takes  into 
account  the  special  characteristics  of  VLSI  has  allowed  a  comparison  of  different 
styles  of  DFT  processor  design  free  from  the  details  of  a  specific  implementa¬ 
tion. 
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